Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming

Caio Corro; Ivan Titov

arxiv: 1906.09992 · v1 · pith:K3MUG64Ynew · submitted 2019-06-24 · 💻 cs.CL · cs.LG

Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming

Caio Corro , Ivan Titov This is my paper

Pith reviewed 2026-05-25 17:31 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords latent tree learningGumbel perturbationsdifferentiable dynamic programmingprojective dependency treessentiment analysisnatural language inferencestructure induction

0 comments

The pith

Projective dependency trees can be induced as latent variables to benefit NLP tasks using Gumbel perturbations and differentiable dynamic programming without any tree supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a model that treats projective dependency trees as latent variables to support downstream tasks such as sentiment analysis and natural language inference. It induces useful trees without direct supervision by combining Gumbel perturbations for stochastic global sampling with differentiable dynamic programming, resulting in a fully differentiable parser. This differs from earlier latent tree approaches that lack global stochastic sampling. Experiments demonstrate the method's effectiveness, while ablations confirm that both stochasticity and the projective constraint matter for the observed gains.

Core claim

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projectve

What carries the argument

Gumbel perturbations combined with differentiable dynamic programming, which permits stochastic sampling of global projective dependency tree structures as latent variables.

If this is right

The approach shows effectiveness on sentiment analysis and natural language inference tasks.
Properties of the method are studied via a synthetic structure induction task.
Stochasticity in global structure sampling contributes to the results.
Constraining the latent structures to projective trees is important according to the ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fully differentiable nature may allow seamless combination with other neural modules in larger pipelines.
Similar perturbation techniques could be tested on non-projective structures or different structured prediction problems.
Task-specific trees learned this way may differ systematically from those produced by standard supervised parsers.

Load-bearing premise

That constraining latent structures to projective trees and using stochastic global sampling will produce trees beneficial for the downstream task.

What would settle it

If the full model shows no performance gain on sentiment analysis or natural language inference compared with versions that remove stochastic sampling or the projectivity constraint.

Figures

Figures reproduced from arXiv: 1906.09992 by Caio Corro, Ivan Titov.

**Figure 2.** Figure 2: An example from the ListOps dataset. Num [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of trees induced on the matched development set of MultiNLI, the model using 2 GCN layers. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Single output of an arg max function. The derivative is null almost everywhere, i.e. there is no descent direction. (b) Single output of the differentiable relaxation. The derivatives are non-null. static graph. Instead of relying on masking, we add an input the DDP node that contains the sentence size : therefore, even if the size of the graph is fixed, the cubic-time algorithm is run on the true inp… view at source ↗

**Figure 5.** Figure 5: Accuracy of tagging and attachment score of the latent tree during training. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable way to sample full projective trees stochastically inside an end-to-end differentiable parser using Gumbel perturbations and DP, but the experiments leave open whether the trees themselves drive the gains or just add useful noise.

read the letter

The core contribution is the combination of Gumbel perturbations with differentiable dynamic programming so that a model can draw complete projective dependency trees as latent variables and still back-propagate through the sampling step. That is a concrete algorithmic step beyond earlier latent-tree work that either used local decisions or non-differentiable global inference. They evaluate on sentiment analysis and natural language inference plus a synthetic structure-induction task, and the ablations show clear drops when either the stochastic sampling or the projectivity constraint is removed. That part is cleanly executed and worth having in the literature for people who want global latent structures without breaking differentiability. The main uncertainty is whether the induced trees are capturing task-relevant dependencies or whether the performance lift comes from the regularization and inductive bias that the perturbations and projectivity constraint introduce. The synthetic task lets them study properties, but the abstract and reported results do not include direct checks such as tree inspection against gold structures or controlled comparisons that would separate structure from regularization. Without those, the claim that the trees are beneficial rests on the performance numbers and ablations alone. This paper is aimed at researchers working on latent-variable models for structured prediction in NLP. Anyone already building parsers or tree induction modules will find the technical details useful and the experiments straightforward to replicate. It is solid enough on the algorithmic side to merit a full referee process even if the authors are asked to add more analysis of the actual trees.

Referee Report

2 major / 1 minor

Summary. The paper proposes a method to treat projective dependency trees as latent variables in a probabilistic model, inducing them via Gumbel perturbations and differentiable dynamic programming to benefit downstream tasks without direct tree supervision. Unlike prior latent tree approaches, it enables stochastic sampling of global structures with a fully differentiable parser. Effectiveness is shown on sentiment analysis and natural language inference, with properties studied on a synthetic structure induction task; ablations highlight the roles of stochasticity and the projective constraint.

Significance. If the results hold, this advances latent structure learning by enabling fully differentiable stochastic global sampling of projective trees, addressing limitations of local or non-differentiable prior methods. The combination of Gumbel perturbations with differentiable DP for global structures is a clear technical strength, and the synthetic task provides an opportunity for controlled analysis of induction properties.

major comments (2)

[Ablation studies and synthetic task section] The central claim that the induced trees are beneficial for the downstream task (rather than the method providing regularization or inductive bias) rests on ablations showing performance drops when stochasticity or projectivity is removed. However, these ablations do not include direct validation that the sampled trees capture task-relevant dependencies (e.g., no tree inspection, comparison to gold structures, or recovery metrics on the synthetic task).
[Evaluation sections] The abstract and evaluation sections claim effectiveness on SA and NLI tasks, but without reported quantitative results, error bars, dataset details, or baseline comparisons in the provided text, the strength of the empirical support for the method's superiority remains difficult to assess.

minor comments (1)

The abstract would be strengthened by including at least one key quantitative result (with error bars) from the main experiments to substantiate the effectiveness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below, providing clarifications and indicating where revisions will be made.

read point-by-point responses

Referee: [Ablation studies and synthetic task section] The central claim that the induced trees are beneficial for the downstream task (rather than the method providing regularization or inductive bias) rests on ablations showing performance drops when stochasticity or projectivity is removed. However, these ablations do not include direct validation that the sampled trees capture task-relevant dependencies (e.g., no tree inspection, comparison to gold structures, or recovery metrics on the synthetic task).

Authors: We agree that additional direct validation of the induced trees, such as example inspections or recovery metrics on the synthetic task, would strengthen the interpretation that performance gains arise from task-relevant structures rather than generic regularization. The synthetic task is intended to enable controlled study of induction properties under known conditions, and the ablations isolate the contributions of stochastic global sampling and the projective constraint. We will incorporate further analysis of the induced structures in the revised manuscript. revision: partial
Referee: [Evaluation sections] The abstract and evaluation sections claim effectiveness on SA and NLI tasks, but without reported quantitative results, error bars, dataset details, or baseline comparisons in the provided text, the strength of the empirical support for the method's superiority remains difficult to assess.

Authors: The full manuscript reports quantitative results with error bars, dataset details, and baseline comparisons in the evaluation sections for both sentiment analysis and natural language inference. The abstract summarizes the findings at a high level without numerical values, following standard conventions. We will ensure these empirical details are clearly highlighted and cross-referenced in the revised version. revision: no

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces an algorithmic method for latent projective tree induction via Gumbel perturbations and differentiable dynamic programming, evaluated empirically on SA/NLI tasks and a synthetic task. No equations, self-citations, or derivations reduce any claimed result to a fitted input or self-definition by construction. Ablations are presented as supporting evidence for component importance but do not constitute a circular reduction. This matches the expected non-circular outcome for a methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from dependency parsing and differentiable optimization; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Projective dependency trees form a useful and sufficient class of latent structures for the target tasks.
The paper explicitly constrains structures to be projective and reports ablations showing the constraint matters.

pith-pipeline@v0.9.0 · 5616 in / 1175 out tokens · 27994 ms · 2026-05-25T17:31:50.070561+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Simaan. 2017. http://www.aclweb.org/anthology/D17-1208 Graph convolutional encoders for syntax-aware neural machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1947--1957. Association for Computational Linguistics

work page 2017
[2]

Yoshua Bengio. 2013. Estimating or propagating gradients through stochastic neurons. arXiv preprint arXiv:1305.2982

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1075 2015
[4]

Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press

work page 2004
[5]

Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to compose task-specific tree structures. In Proceedings of the 2018 Association for the Advancement of Artificial Intelligence (AAAI). and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP)

work page 2018
[6]

Caio Corro and Ivan Titov. 2019. Differentiable perturb-and-parse: Semi-supervised parsing with a structured variational autoencoder. In Proceedings of the International Conference on Learning Representations

work page 2019
[7]

Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. 2005. Question answering passage retrieval using dependency relations. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 400--407. ACM

work page 2005
[8]

D Den Hertog, Cornelis Roos, and Tam \'a s Terlaky. 1994. Inverse barrier methods for linear programming. RAIRO-Operations Research, 28(2):135--163

work page 1994
[9]

Jason M. Eisner. 1996. http://www.aclweb.org/anthology/C96-1058 Three new probabilistic models for dependency parsing: An exploration . In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

work page 1996
[10]

Jennifer Foster, \"O zlem C etinoglu, Joachim Wagner, Joseph Le Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef Van Genabith. 2011. \# hardtoparse: Pos tagging and parsing the twitterverse. In AAAI 2011 workshop on analyzing microtext, pages 20--25

work page 2011
[11]

Michael C Fu. 2006. Gradient estimation. Handbooks in operations research and management science, 13:575--616

work page 2006
[12]

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. 2016. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. 2018. http://aclweb.org/anthology/D18-1544 Grammar induction with neural language models: An unusual replication . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4998--5003. Association for Computational Linguistics

work page 2018
[14]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the 2017 International Conference on Learning Representations

work page 2017
[15]

Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured attention networks. In Proceedings of the International Conference on Learning Representations

work page 2017
[16]

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations

work page 2014
[17]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations

work page 2017
[18]

Dan Klein and Christopher D Manning. 2005. The unsupervised learning of natural language structure. Stanford University Stanford, CA

work page 2005
[19]

John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning

work page 2001
[20]

Yang Liu and Mirella Lapata. 2018. http://aclweb.org/anthology/Q18-1005 Learning structured text representations . Transactions of the Association for Computational Linguistics, 6:63--75

work page 2018
[21]

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng WANG. 2015. https://doi.org/10.3115/v1/P15-2047 A dependency-based neural network for relation classification . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short ...

work page doi:10.3115/v1/p15-2047 2015
[22]

Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Advances in Neural Information Processing Systems, pages 3086--3094

work page 2014
[23]

Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly learning sentence embeddings and syntax with unsupervised tree- LSTM s. arXiv preprint arXiv:1705.09189

work page arXiv 2017
[24]

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. https://doi.org/10.3115/v1/P14-5010 The stanford corenlp natural language processing toolkit . In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55--60. Association for Computational Li...

work page doi:10.3115/v1/p14-5010 2014
[25]

Diego Marcheggiani and Ivan Titov. 2017. http://www.aclweb.org/anthology/D17-1159 Encoding sentences with graph convolutional networks for semantic role labeling . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1507--1516. Association for Computational Linguistics

work page 2017
[26]

Arthur Mensch and Mathieu Blondel. 2018. Differentiable dynamic programming for structured prediction and attention. In Proceedings of the 35th International Conference on Machine Learning

work page 2018
[27]

Nikita Nangia and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-4013 Listops: A diagnostic dataset for latent tree learning . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 92--99. Association for Computational Linguistics

work page doi:10.18653/v1/n18-4013 2018
[28]

Jason Naradowsky, Sebastian Riedel, and David Smith. 2012. https://www.aclweb.org/anthology/D12-1074 Improving NLP through marginalization of hidden syntactic structure . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 810--820, Jeju Island, Korea. Associati...

work page 2012
[29]

Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, S...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Vlad Niculae, Andr \'e F. T. Martins, and Claire Cardie. 2018. http://aclweb.org/anthology/D18-1108 Towards dynamic computation graphs via sparse latent structure . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 905--911. Association for Computational Linguistics

work page 2018
[31]

George Papandreou and Alan L Yuille. 2011. Perturb-and- MAP random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 193--200. IEEE

work page 2011
[32]

a ckstr \

Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A decomposable attention model for natural language inference . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249--2255. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1244 2016
[33]

Hao Peng, Sam Thomson, and Noah A. Smith. 2018. http://aclweb.org/anthology/P18-1173 Backpropagating through structured argmax using a spigot . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1863--1873. Association for Computational Linguistics

work page 2018
[34]

Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Hiyan Alshawi. 2010. http://aclweb.org/anthology/D10-1069 Uptraining for accurate deterministic question parsing . In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 705--713. Association for Computational Linguistics

work page 2010
[35]

Florian A Potra and Stephen J Wright. 2000. Interior-point methods. Journal of Computational and Applied Mathematics, 124(1-2):281--302

work page 2000
[36]

Anand Rangarajan. 2000. Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition, 33(4):635--649

work page 2000
[37]

John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528--3536

work page 2015
[38]

Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. 2018. https://openreview.net/forum?id=rkgOLb-0W Neural language modeling by jointly learning syntax and lexicon . In International Conference on Learning Representations

work page 2018
[39]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. http://www.aclweb.org/anthology/D13-1170 Recursive deep models for semantic compositionality over a sentiment treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642. Associat...

work page 2013
[40]

Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. http://aclweb.org/anthology/D18-1509 A tree-based decoder for neural machine translation . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4772--4777. Association for Computational Linguistics

work page 2018
[41]

Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018 a . http://aclweb.org/anthology/Q18-1019 Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics, 6:253--267

work page 2018
[42]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018 b . http://aclweb.org/anthology/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 111...

work page 2018
[43]

R Williams. 1987. A class of gradient-estimation algorithms for reinforcement learning in neural networks. In Proceedings of the International Conference on Neural Networks, pages II--601

work page 1987
[44]

Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to compose words into sentences with reinforcement learning. In Proceedings of the International Conference on Learning Representations

work page 2017
[45]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Simaan. 2017. http://www.aclweb.org/anthology/D17-1208 Graph convolutional encoders for syntax-aware neural machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1947--1957. Association for Computational Linguistics

work page 2017

[2] [2]

Yoshua Bengio. 2013. Estimating or propagating gradients through stochastic neurons. arXiv preprint arXiv:1305.2982

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1075 2015

[4] [4]

Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press

work page 2004

[5] [5]

Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to compose task-specific tree structures. In Proceedings of the 2018 Association for the Advancement of Artificial Intelligence (AAAI). and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP)

work page 2018

[6] [6]

Caio Corro and Ivan Titov. 2019. Differentiable perturb-and-parse: Semi-supervised parsing with a structured variational autoencoder. In Proceedings of the International Conference on Learning Representations

work page 2019

[7] [7]

Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. 2005. Question answering passage retrieval using dependency relations. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 400--407. ACM

work page 2005

[8] [8]

D Den Hertog, Cornelis Roos, and Tam \'a s Terlaky. 1994. Inverse barrier methods for linear programming. RAIRO-Operations Research, 28(2):135--163

work page 1994

[9] [9]

Jason M. Eisner. 1996. http://www.aclweb.org/anthology/C96-1058 Three new probabilistic models for dependency parsing: An exploration . In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

work page 1996

[10] [10]

Jennifer Foster, \"O zlem C etinoglu, Joachim Wagner, Joseph Le Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef Van Genabith. 2011. \# hardtoparse: Pos tagging and parsing the twitterverse. In AAAI 2011 workshop on analyzing microtext, pages 20--25

work page 2011

[11] [11]

Michael C Fu. 2006. Gradient estimation. Handbooks in operations research and management science, 13:575--616

work page 2006

[12] [12]

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. 2016. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. 2018. http://aclweb.org/anthology/D18-1544 Grammar induction with neural language models: An unusual replication . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4998--5003. Association for Computational Linguistics

work page 2018

[14] [14]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the 2017 International Conference on Learning Representations

work page 2017

[15] [15]

Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured attention networks. In Proceedings of the International Conference on Learning Representations

work page 2017

[16] [16]

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations

work page 2014

[17] [17]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations

work page 2017

[18] [18]

Dan Klein and Christopher D Manning. 2005. The unsupervised learning of natural language structure. Stanford University Stanford, CA

work page 2005

[19] [19]

John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning

work page 2001

[20] [20]

Yang Liu and Mirella Lapata. 2018. http://aclweb.org/anthology/Q18-1005 Learning structured text representations . Transactions of the Association for Computational Linguistics, 6:63--75

work page 2018

[21] [21]

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng WANG. 2015. https://doi.org/10.3115/v1/P15-2047 A dependency-based neural network for relation classification . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short ...

work page doi:10.3115/v1/p15-2047 2015

[22] [22]

Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Advances in Neural Information Processing Systems, pages 3086--3094

work page 2014

[23] [23]

Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly learning sentence embeddings and syntax with unsupervised tree- LSTM s. arXiv preprint arXiv:1705.09189

work page arXiv 2017

[24] [24]

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. https://doi.org/10.3115/v1/P14-5010 The stanford corenlp natural language processing toolkit . In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55--60. Association for Computational Li...

work page doi:10.3115/v1/p14-5010 2014

[25] [25]

Diego Marcheggiani and Ivan Titov. 2017. http://www.aclweb.org/anthology/D17-1159 Encoding sentences with graph convolutional networks for semantic role labeling . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1507--1516. Association for Computational Linguistics

work page 2017

[26] [26]

Arthur Mensch and Mathieu Blondel. 2018. Differentiable dynamic programming for structured prediction and attention. In Proceedings of the 35th International Conference on Machine Learning

work page 2018

[27] [27]

Nikita Nangia and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-4013 Listops: A diagnostic dataset for latent tree learning . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 92--99. Association for Computational Linguistics

work page doi:10.18653/v1/n18-4013 2018

[28] [28]

Jason Naradowsky, Sebastian Riedel, and David Smith. 2012. https://www.aclweb.org/anthology/D12-1074 Improving NLP through marginalization of hidden syntactic structure . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 810--820, Jeju Island, Korea. Associati...

work page 2012

[29] [29]

Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, S...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Vlad Niculae, Andr \'e F. T. Martins, and Claire Cardie. 2018. http://aclweb.org/anthology/D18-1108 Towards dynamic computation graphs via sparse latent structure . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 905--911. Association for Computational Linguistics

work page 2018

[31] [31]

George Papandreou and Alan L Yuille. 2011. Perturb-and- MAP random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 193--200. IEEE

work page 2011

[32] [32]

a ckstr \

Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A decomposable attention model for natural language inference . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249--2255. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1244 2016

[33] [33]

Hao Peng, Sam Thomson, and Noah A. Smith. 2018. http://aclweb.org/anthology/P18-1173 Backpropagating through structured argmax using a spigot . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1863--1873. Association for Computational Linguistics

work page 2018

[34] [34]

Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Hiyan Alshawi. 2010. http://aclweb.org/anthology/D10-1069 Uptraining for accurate deterministic question parsing . In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 705--713. Association for Computational Linguistics

work page 2010

[35] [35]

Florian A Potra and Stephen J Wright. 2000. Interior-point methods. Journal of Computational and Applied Mathematics, 124(1-2):281--302

work page 2000

[36] [36]

Anand Rangarajan. 2000. Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition, 33(4):635--649

work page 2000

[37] [37]

John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528--3536

work page 2015

[38] [38]

Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. 2018. https://openreview.net/forum?id=rkgOLb-0W Neural language modeling by jointly learning syntax and lexicon . In International Conference on Learning Representations

work page 2018

[39] [39]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. http://www.aclweb.org/anthology/D13-1170 Recursive deep models for semantic compositionality over a sentiment treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642. Associat...

work page 2013

[40] [40]

Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. http://aclweb.org/anthology/D18-1509 A tree-based decoder for neural machine translation . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4772--4777. Association for Computational Linguistics

work page 2018

[41] [41]

Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018 a . http://aclweb.org/anthology/Q18-1019 Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics, 6:253--267

work page 2018

[42] [42]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018 b . http://aclweb.org/anthology/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 111...

work page 2018

[43] [43]

R Williams. 1987. A class of gradient-estimation algorithms for reinforcement learning in neural networks. In Proceedings of the International Conference on Neural Networks, pages II--601

work page 1987

[44] [44]

Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to compose words into sentences with reinforcement learning. In Proceedings of the International Conference on Learning Representations

work page 2017

[45] [45]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[46] [46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page