Inducing Syntactic Trees from BERT Representations

David Mare\v{c}ek; Rudolf Rosa

arxiv: 1906.11511 · v1 · pith:RAS3QWZXnew · submitted 2019-06-27 · 💻 cs.CL

Inducing Syntactic Trees from BERT Representations

Rudolf Rosa , David Mare\v{c}ek This is my paper

Pith reviewed 2026-05-25 15:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords BERTdependency parsingunsupervised syntaxreducibilitycontextual embeddingsword deletionsyntactic trees

0 comments

The pith

BERT representations encode word reducibility that can be used to induce full dependency trees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether BERT's contextual embeddings change more when a syntactically central word is deleted than when an optional word is removed. Reducibility is quantified for single words and for continuous phrases by averaging the shift in other tokens' vectors after deletion. These scores are shown to align with syntactic roles, and the authors assemble them into complete dependency trees without any labeled training data. A sympathetic reader would care because the method offers an unsupervised route from a pretrained language model to explicit syntax.

Core claim

Reducibility of a word or n-gram is defined as the average Euclidean distance between the BERT embeddings of the remaining tokens computed on the original sentence versus the sentence with that word or n-gram removed. Higher reducibility indicates that the item is syntactically peripheral. The authors compute these values across a corpus, observe that they correlate with part-of-speech and dependency labels, and then use the scores as edge weights to induce full dependency trees via a minimum spanning tree algorithm.

What carries the argument

Reducibility score computed from the change in BERT token embeddings upon deletion of a word or phrase.

If this is right

Reducibility is higher for adjectives and adverbs than for main verbs and subjects.
Continuous phrases can be ranked by reducibility to identify optional modifiers.
A single pass over a corpus yields both per-word reducibility values and complete parse trees.
The same deletion-based procedure works for n-grams longer than one word.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may generalize to other transformer models whose representations are similarly sensitive to local grammatical violations.
Reducibility could serve as an auxiliary signal for improving unsupervised parsers in low-resource settings.
If the correlation with syntax holds across languages, the method supplies a language-agnostic way to bootstrap treebanks from raw text.

Load-bearing premise

The size of the change in BERT representations when a word is deleted tracks its syntactic importance rather than unrelated factors such as frequency or semantic salience.

What would settle it

Induced trees achieve low unlabeled attachment score against gold dependency annotations on a held-out corpus such as the Penn Treebank.

Figures

Figures reproduced from arXiv: 1906.11511 by David Mare\v{c}ek, Rudolf Rosa.

read the original abstract

We use the English model of BERT and explore how a deletion of one word in a sentence changes representations of other words. Our hypothesis is that removing a reducible word (e.g. an adjective) does not affect the representation of other words so much as removing e.g. the main verb, which makes the sentence ungrammatical and of "high surprise" for the language model. We estimate reducibilities of individual words and also of longer continuous phrases (word n-grams), study their syntax-related properties, and then also use them to induce full dependency trees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that deleting individual words or continuous n-grams from English sentences and measuring the resulting changes in BERT token representations can be used to estimate 'reducibility' scores for those tokens/phrases; these scores are hypothesized to reflect syntactic optionality (reducible words like adjectives perturb other representations less than non-reducible words like main verbs), their syntax-related properties are analyzed, and the scores are then used to induce full dependency trees.

Significance. If the central hypothesis holds after proper validation, the work would offer a novel, fully unsupervised route to syntactic structure extraction from a pre-trained LM, potentially illuminating what BERT encodes about syntax and providing a new signal for dependency parsing without treebank supervision.

major comments (3)

[Abstract] Abstract: the manuscript states the hypothesis and intended pipeline but supplies no quantitative results, validation metrics (e.g., UAS/LAS against gold trees), or derivation details on how reducibility scores are turned into trees; therefore the data-to-claim link cannot be assessed.
[Method] No section describes control experiments that hold sentence-level perplexity or grammaticality fixed while varying the syntactic position of the deleted token; without such controls the reducibility signal may simply track overall surprisal rather than dependency structure.
[Experiments] No table or section reports any comparison of the induced trees to baselines or gold-standard parses, leaving the claim that full dependency trees can be induced from these scores untested.

minor comments (1)

[Method] Notation for 'reducibility' and the precise aggregation over n-grams is introduced without an equation; adding a formal definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states the hypothesis and intended pipeline but supplies no quantitative results, validation metrics (e.g., UAS/LAS against gold trees), or derivation details on how reducibility scores are turned into trees; therefore the data-to-claim link cannot be assessed.

Authors: We agree that the abstract would be improved by including key quantitative outcomes. The body of the manuscript details the tree induction procedure (reducibility scores are used to weight potential dependency edges, followed by a maximum spanning tree algorithm) and reports correlations between reducibility and syntactic categories. In revision we will expand the abstract to summarize the main empirical findings, including the UAS/LAS numbers obtained on gold-standard trees. revision: yes
Referee: [Method] No section describes control experiments that hold sentence-level perplexity or grammaticality fixed while varying the syntactic position of the deleted token; without such controls the reducibility signal may simply track overall surprisal rather than dependency structure.

Authors: This is a fair criticism; the current manuscript does not contain explicit controls that isolate syntactic position while holding perplexity constant. We will add a dedicated subsection that performs such controls (e.g., comparing deletions of words with matched surprisal but different syntactic roles) to demonstrate that the observed signal is not reducible to global surprisal alone. revision: yes
Referee: [Experiments] No table or section reports any comparison of the induced trees to baselines or gold-standard parses, leaving the claim that full dependency trees can be induced from these scores untested.

Authors: We acknowledge that while the manuscript describes how reducibility scores are converted into trees, it does not present a quantitative evaluation against gold parses or baselines. In the revised version we will add a results table reporting UAS and LAS on the Penn Treebank together with comparisons to unsupervised baselines (e.g., random spanning trees and PMI-based methods). revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from BERT outputs.

full rationale

The paper computes word/phrase reducibility directly from measured changes in BERT token representations after deletion, then uses those scores to induce trees. No equations fit parameters to target dependency trees, no self-citation chain justifies the core hypothesis, and no ansatz or uniqueness theorem is smuggled in. The skeptic concern (surprisal vs. syntax) addresses external validity, not internal reduction of the claimed derivation to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption stated in the abstract; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption The magnitude of change in BERT representations after word deletion correlates with the syntactic reducibility of the deleted word.
This is the explicit hypothesis used to justify both the reducibility estimation and the subsequent tree induction.

pith-pipeline@v0.9.0 · 5610 in / 1063 out tokens · 33169 ms · 2026-05-25T15:07:33.633996+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Yonatan Belinkov. 2018. On Internal Language Representations in Deep Learning:An Analysis of Machine Translation and Speech Recognition . Ph.D. thesis, Massachusetts Institute of Technology

work page 2018
[4]

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT: pre-training of deep bidirectional transformers for language understanding . CoRR, abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

John Hewitt and Christopher D. Manning. 2019. Structural Probe for Finding Syntax in Word Representations . In Proceedings of NAACL 2019

work page 2019
[6]

McDonald, and Joakim Nivre

Sandra K \"u bler, Ryan T. McDonald, and Joakim Nivre. 2009. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers

work page 2009
[7]

Mark \' e ta Lopatkov \' a , Martin Pl \' a tek, and Vladislav Kubo n . 2005. Modeling syntax of free word-order languages: Dependency analysis by reduction. In Lecture Notes in Artificial Intelligence, Proceedings of the 8th International Conference, TSD 2005 , volume 3658 of Lecture Notes in Computer Science, pages 140--147, Berlin / Heidelberg. Springer

work page 2005
[8]

David Mare c ek and Zden e k Z abokrtsk\' y . 2012. Exploiting reducibility in unsupervised dependency parsing . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 297--307, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2012
[9]

o rstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, G \

Joakim Nivre, Mitchell Abrams, Z eljko Agi \'c , Lars Ahrenberg, Lene Antonsen, Katya Aplonova, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Victoria Basmov, John Bauer, Sandra Bellato, K...

work page 2018
[10]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL

work page 2018

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Yonatan Belinkov. 2018. On Internal Language Representations in Deep Learning:An Analysis of Machine Translation and Speech Recognition . Ph.D. thesis, Massachusetts Institute of Technology

work page 2018

[4] [4]

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT: pre-training of deep bidirectional transformers for language understanding . CoRR, abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

John Hewitt and Christopher D. Manning. 2019. Structural Probe for Finding Syntax in Word Representations . In Proceedings of NAACL 2019

work page 2019

[6] [6]

McDonald, and Joakim Nivre

Sandra K \"u bler, Ryan T. McDonald, and Joakim Nivre. 2009. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers

work page 2009

[7] [7]

Mark \' e ta Lopatkov \' a , Martin Pl \' a tek, and Vladislav Kubo n . 2005. Modeling syntax of free word-order languages: Dependency analysis by reduction. In Lecture Notes in Artificial Intelligence, Proceedings of the 8th International Conference, TSD 2005 , volume 3658 of Lecture Notes in Computer Science, pages 140--147, Berlin / Heidelberg. Springer

work page 2005

[8] [8]

David Mare c ek and Zden e k Z abokrtsk\' y . 2012. Exploiting reducibility in unsupervised dependency parsing . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 297--307, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2012

[9] [9]

o rstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, G \

Joakim Nivre, Mitchell Abrams, Z eljko Agi \'c , Lars Ahrenberg, Lene Antonsen, Katya Aplonova, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Victoria Basmov, John Bauer, Sandra Bellato, K...

work page 2018

[10] [10]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL

work page 2018