pith. sign in

arxiv: 1906.09992 · v1 · pith:K3MUG64Ynew · submitted 2019-06-24 · 💻 cs.CL · cs.LG

Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming

Pith reviewed 2026-05-25 17:31 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords latent tree learningGumbel perturbationsdifferentiable dynamic programmingprojective dependency treessentiment analysisnatural language inferencestructure induction
0
0 comments X

The pith

Projective dependency trees can be induced as latent variables to benefit NLP tasks using Gumbel perturbations and differentiable dynamic programming without any tree supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a model that treats projective dependency trees as latent variables to support downstream tasks such as sentiment analysis and natural language inference. It induces useful trees without direct supervision by combining Gumbel perturbations for stochastic global sampling with differentiable dynamic programming, resulting in a fully differentiable parser. This differs from earlier latent tree approaches that lack global stochastic sampling. Experiments demonstrate the method's effectiveness, while ablations confirm that both stochasticity and the projective constraint matter for the observed gains.

Core claim

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projectve

What carries the argument

Gumbel perturbations combined with differentiable dynamic programming, which permits stochastic sampling of global projective dependency tree structures as latent variables.

If this is right

  • The approach shows effectiveness on sentiment analysis and natural language inference tasks.
  • Properties of the method are studied via a synthetic structure induction task.
  • Stochasticity in global structure sampling contributes to the results.
  • Constraining the latent structures to projective trees is important according to the ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fully differentiable nature may allow seamless combination with other neural modules in larger pipelines.
  • Similar perturbation techniques could be tested on non-projective structures or different structured prediction problems.
  • Task-specific trees learned this way may differ systematically from those produced by standard supervised parsers.

Load-bearing premise

That constraining latent structures to projective trees and using stochastic global sampling will produce trees beneficial for the downstream task.

What would settle it

If the full model shows no performance gain on sentiment analysis or natural language inference compared with versions that remove stochastic sampling or the projectivity constraint.

Figures

Figures reproduced from arXiv: 1906.09992 by Caio Corro, Ivan Titov.

Figure 1
Figure 1. Figure 1: The two directed graphical models used in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example from the ListOps dataset. Num [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of trees induced on the matched development set of MultiNLI, the model using 2 GCN layers. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Single output of an arg max function. The derivative is null almost everywhere, i.e. there is no descent direction. (b) Single output of the differentiable relaxation. The derivatives are non-null. static graph. Instead of relying on masking, we add an input the DDP node that contains the sen￾tence size : therefore, even if the size of the graph is fixed, the cubic-time algorithm is run on the true inp… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of tagging and attachment score of the latent tree during training. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projective trees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a method to treat projective dependency trees as latent variables in a probabilistic model, inducing them via Gumbel perturbations and differentiable dynamic programming to benefit downstream tasks without direct tree supervision. Unlike prior latent tree approaches, it enables stochastic sampling of global structures with a fully differentiable parser. Effectiveness is shown on sentiment analysis and natural language inference, with properties studied on a synthetic structure induction task; ablations highlight the roles of stochasticity and the projective constraint.

Significance. If the results hold, this advances latent structure learning by enabling fully differentiable stochastic global sampling of projective trees, addressing limitations of local or non-differentiable prior methods. The combination of Gumbel perturbations with differentiable DP for global structures is a clear technical strength, and the synthetic task provides an opportunity for controlled analysis of induction properties.

major comments (2)
  1. [Ablation studies and synthetic task section] The central claim that the induced trees are beneficial for the downstream task (rather than the method providing regularization or inductive bias) rests on ablations showing performance drops when stochasticity or projectivity is removed. However, these ablations do not include direct validation that the sampled trees capture task-relevant dependencies (e.g., no tree inspection, comparison to gold structures, or recovery metrics on the synthetic task).
  2. [Evaluation sections] The abstract and evaluation sections claim effectiveness on SA and NLI tasks, but without reported quantitative results, error bars, dataset details, or baseline comparisons in the provided text, the strength of the empirical support for the method's superiority remains difficult to assess.
minor comments (1)
  1. The abstract would be strengthened by including at least one key quantitative result (with error bars) from the main experiments to substantiate the effectiveness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below, providing clarifications and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Ablation studies and synthetic task section] The central claim that the induced trees are beneficial for the downstream task (rather than the method providing regularization or inductive bias) rests on ablations showing performance drops when stochasticity or projectivity is removed. However, these ablations do not include direct validation that the sampled trees capture task-relevant dependencies (e.g., no tree inspection, comparison to gold structures, or recovery metrics on the synthetic task).

    Authors: We agree that additional direct validation of the induced trees, such as example inspections or recovery metrics on the synthetic task, would strengthen the interpretation that performance gains arise from task-relevant structures rather than generic regularization. The synthetic task is intended to enable controlled study of induction properties under known conditions, and the ablations isolate the contributions of stochastic global sampling and the projective constraint. We will incorporate further analysis of the induced structures in the revised manuscript. revision: partial

  2. Referee: [Evaluation sections] The abstract and evaluation sections claim effectiveness on SA and NLI tasks, but without reported quantitative results, error bars, dataset details, or baseline comparisons in the provided text, the strength of the empirical support for the method's superiority remains difficult to assess.

    Authors: The full manuscript reports quantitative results with error bars, dataset details, and baseline comparisons in the evaluation sections for both sentiment analysis and natural language inference. The abstract summarizes the findings at a high level without numerical values, following standard conventions. We will ensure these empirical details are clearly highlighted and cross-referenced in the revised version. revision: no

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces an algorithmic method for latent projective tree induction via Gumbel perturbations and differentiable dynamic programming, evaluated empirically on SA/NLI tasks and a synthetic task. No equations, self-citations, or derivations reduce any claimed result to a fitted input or self-definition by construction. Ablations are presented as supporting evidence for component importance but do not constitute a circular reduction. This matches the expected non-circular outcome for a methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from dependency parsing and differentiable optimization; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Projective dependency trees form a useful and sufficient class of latent structures for the target tasks.
    The paper explicitly constrains structures to be projective and reports ablations showing the constraint matters.

pith-pipeline@v0.9.0 · 5616 in / 1175 out tokens · 27994 ms · 2026-05-25T17:31:50.070561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Simaan. 2017. http://www.aclweb.org/anthology/D17-1208 Graph convolutional encoders for syntax-aware neural machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1947--1957. Association for Computational Linguistics

  2. [2]

    Yoshua Bengio. 2013. Estimating or propagating gradients through stochastic neurons. arXiv preprint arXiv:1305.2982

  3. [3]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642. Association for Computational Linguistics

  4. [4]

    Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press

  5. [5]

    Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to compose task-specific tree structures. In Proceedings of the 2018 Association for the Advancement of Artificial Intelligence (AAAI). and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP)

  6. [6]

    Caio Corro and Ivan Titov. 2019. Differentiable perturb-and-parse: Semi-supervised parsing with a structured variational autoencoder. In Proceedings of the International Conference on Learning Representations

  7. [7]

    Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. 2005. Question answering passage retrieval using dependency relations. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 400--407. ACM

  8. [8]

    D Den Hertog, Cornelis Roos, and Tam \'a s Terlaky. 1994. Inverse barrier methods for linear programming. RAIRO-Operations Research, 28(2):135--163

  9. [9]

    Jason M. Eisner. 1996. http://www.aclweb.org/anthology/C96-1058 Three new probabilistic models for dependency parsing: An exploration . In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

  10. [10]

    Jennifer Foster, \"O zlem C etinoglu, Joachim Wagner, Joseph Le Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef Van Genabith. 2011. \# hardtoparse: Pos tagging and parsing the twitterverse. In AAAI 2011 workshop on analyzing microtext, pages 20--25

  11. [11]

    Michael C Fu. 2006. Gradient estimation. Handbooks in operations research and management science, 13:575--616

  12. [12]

    Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. 2016. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447

  13. [13]

    Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. 2018. http://aclweb.org/anthology/D18-1544 Grammar induction with neural language models: An unusual replication . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4998--5003. Association for Computational Linguistics

  14. [14]

    Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the 2017 International Conference on Learning Representations

  15. [15]

    Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured attention networks. In Proceedings of the International Conference on Learning Representations

  16. [16]

    Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations

  17. [17]

    Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations

  18. [18]

    Dan Klein and Christopher D Manning. 2005. The unsupervised learning of natural language structure. Stanford University Stanford, CA

  19. [19]

    John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning

  20. [20]

    Yang Liu and Mirella Lapata. 2018. http://aclweb.org/anthology/Q18-1005 Learning structured text representations . Transactions of the Association for Computational Linguistics, 6:63--75

  21. [21]

    Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng WANG. 2015. https://doi.org/10.3115/v1/P15-2047 A dependency-based neural network for relation classification . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short ...

  22. [22]

    Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Advances in Neural Information Processing Systems, pages 3086--3094

  23. [23]

    Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly learning sentence embeddings and syntax with unsupervised tree- LSTM s. arXiv preprint arXiv:1705.09189

  24. [24]

    Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. https://doi.org/10.3115/v1/P14-5010 The stanford corenlp natural language processing toolkit . In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55--60. Association for Computational Li...

  25. [25]

    Diego Marcheggiani and Ivan Titov. 2017. http://www.aclweb.org/anthology/D17-1159 Encoding sentences with graph convolutional networks for semantic role labeling . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1507--1516. Association for Computational Linguistics

  26. [26]

    Arthur Mensch and Mathieu Blondel. 2018. Differentiable dynamic programming for structured prediction and attention. In Proceedings of the 35th International Conference on Machine Learning

  27. [27]

    Nikita Nangia and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-4013 Listops: A diagnostic dataset for latent tree learning . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 92--99. Association for Computational Linguistics

  28. [28]

    Jason Naradowsky, Sebastian Riedel, and David Smith. 2012. https://www.aclweb.org/anthology/D12-1074 Improving NLP through marginalization of hidden syntactic structure . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 810--820, Jeju Island, Korea. Associati...

  29. [29]

    Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, S...

  30. [30]

    Vlad Niculae, Andr \'e F. T. Martins, and Claire Cardie. 2018. http://aclweb.org/anthology/D18-1108 Towards dynamic computation graphs via sparse latent structure . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 905--911. Association for Computational Linguistics

  31. [31]

    George Papandreou and Alan L Yuille. 2011. Perturb-and- MAP random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 193--200. IEEE

  32. [32]

    a ckstr \

    Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A decomposable attention model for natural language inference . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249--2255. Association for Computational Linguistics

  33. [33]

    Hao Peng, Sam Thomson, and Noah A. Smith. 2018. http://aclweb.org/anthology/P18-1173 Backpropagating through structured argmax using a spigot . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1863--1873. Association for Computational Linguistics

  34. [34]

    Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Hiyan Alshawi. 2010. http://aclweb.org/anthology/D10-1069 Uptraining for accurate deterministic question parsing . In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 705--713. Association for Computational Linguistics

  35. [35]

    Florian A Potra and Stephen J Wright. 2000. Interior-point methods. Journal of Computational and Applied Mathematics, 124(1-2):281--302

  36. [36]

    Anand Rangarajan. 2000. Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition, 33(4):635--649

  37. [37]

    John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528--3536

  38. [38]

    Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. 2018. https://openreview.net/forum?id=rkgOLb-0W Neural language modeling by jointly learning syntax and lexicon . In International Conference on Learning Representations

  39. [39]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. http://www.aclweb.org/anthology/D13-1170 Recursive deep models for semantic compositionality over a sentiment treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642. Associat...

  40. [40]

    Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. http://aclweb.org/anthology/D18-1509 A tree-based decoder for neural machine translation . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4772--4777. Association for Computational Linguistics

  41. [41]

    Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018 a . http://aclweb.org/anthology/Q18-1019 Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics, 6:253--267

  42. [42]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018 b . http://aclweb.org/anthology/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 111...

  43. [43]

    R Williams. 1987. A class of gradient-estimation algorithms for reinforcement learning in neural networks. In Proceedings of the International Conference on Neural Networks, pages II--601

  44. [44]

    Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to compose words into sentences with reinforcement learning. In Proceedings of the International Conference on Learning Representations

  45. [45]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  46. [46]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...