Inducing Syntactic Trees from BERT Representations
Pith reviewed 2026-05-25 15:07 UTC · model grok-4.3
The pith
BERT representations encode word reducibility that can be used to induce full dependency trees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reducibility of a word or n-gram is defined as the average Euclidean distance between the BERT embeddings of the remaining tokens computed on the original sentence versus the sentence with that word or n-gram removed. Higher reducibility indicates that the item is syntactically peripheral. The authors compute these values across a corpus, observe that they correlate with part-of-speech and dependency labels, and then use the scores as edge weights to induce full dependency trees via a minimum spanning tree algorithm.
What carries the argument
Reducibility score computed from the change in BERT token embeddings upon deletion of a word or phrase.
If this is right
- Reducibility is higher for adjectives and adverbs than for main verbs and subjects.
- Continuous phrases can be ranked by reducibility to identify optional modifiers.
- A single pass over a corpus yields both per-word reducibility values and complete parse trees.
- The same deletion-based procedure works for n-grams longer than one word.
Where Pith is reading between the lines
- The approach may generalize to other transformer models whose representations are similarly sensitive to local grammatical violations.
- Reducibility could serve as an auxiliary signal for improving unsupervised parsers in low-resource settings.
- If the correlation with syntax holds across languages, the method supplies a language-agnostic way to bootstrap treebanks from raw text.
Load-bearing premise
The size of the change in BERT representations when a word is deleted tracks its syntactic importance rather than unrelated factors such as frequency or semantic salience.
What would settle it
Induced trees achieve low unlabeled attachment score against gold dependency annotations on a held-out corpus such as the Penn Treebank.
Figures
read the original abstract
We use the English model of BERT and explore how a deletion of one word in a sentence changes representations of other words. Our hypothesis is that removing a reducible word (e.g. an adjective) does not affect the representation of other words so much as removing e.g. the main verb, which makes the sentence ungrammatical and of "high surprise" for the language model. We estimate reducibilities of individual words and also of longer continuous phrases (word n-grams), study their syntax-related properties, and then also use them to induce full dependency trees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that deleting individual words or continuous n-grams from English sentences and measuring the resulting changes in BERT token representations can be used to estimate 'reducibility' scores for those tokens/phrases; these scores are hypothesized to reflect syntactic optionality (reducible words like adjectives perturb other representations less than non-reducible words like main verbs), their syntax-related properties are analyzed, and the scores are then used to induce full dependency trees.
Significance. If the central hypothesis holds after proper validation, the work would offer a novel, fully unsupervised route to syntactic structure extraction from a pre-trained LM, potentially illuminating what BERT encodes about syntax and providing a new signal for dependency parsing without treebank supervision.
major comments (3)
- [Abstract] Abstract: the manuscript states the hypothesis and intended pipeline but supplies no quantitative results, validation metrics (e.g., UAS/LAS against gold trees), or derivation details on how reducibility scores are turned into trees; therefore the data-to-claim link cannot be assessed.
- [Method] No section describes control experiments that hold sentence-level perplexity or grammaticality fixed while varying the syntactic position of the deleted token; without such controls the reducibility signal may simply track overall surprisal rather than dependency structure.
- [Experiments] No table or section reports any comparison of the induced trees to baselines or gold-standard parses, leaving the claim that full dependency trees can be induced from these scores untested.
minor comments (1)
- [Method] Notation for 'reducibility' and the precise aggregation over n-grams is introduced without an equation; adding a formal definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the manuscript states the hypothesis and intended pipeline but supplies no quantitative results, validation metrics (e.g., UAS/LAS against gold trees), or derivation details on how reducibility scores are turned into trees; therefore the data-to-claim link cannot be assessed.
Authors: We agree that the abstract would be improved by including key quantitative outcomes. The body of the manuscript details the tree induction procedure (reducibility scores are used to weight potential dependency edges, followed by a maximum spanning tree algorithm) and reports correlations between reducibility and syntactic categories. In revision we will expand the abstract to summarize the main empirical findings, including the UAS/LAS numbers obtained on gold-standard trees. revision: yes
-
Referee: [Method] No section describes control experiments that hold sentence-level perplexity or grammaticality fixed while varying the syntactic position of the deleted token; without such controls the reducibility signal may simply track overall surprisal rather than dependency structure.
Authors: This is a fair criticism; the current manuscript does not contain explicit controls that isolate syntactic position while holding perplexity constant. We will add a dedicated subsection that performs such controls (e.g., comparing deletions of words with matched surprisal but different syntactic roles) to demonstrate that the observed signal is not reducible to global surprisal alone. revision: yes
-
Referee: [Experiments] No table or section reports any comparison of the induced trees to baselines or gold-standard parses, leaving the claim that full dependency trees can be induced from these scores untested.
Authors: We acknowledge that while the manuscript describes how reducibility scores are converted into trees, it does not present a quantitative evaluation against gold parses or baselines. In the revised version we will add a results table reporting UAS and LAS on the Penn Treebank together with comparisons to unsupervised baselines (e.g., random spanning trees and PMI-based methods). revision: yes
Circularity Check
No significant circularity; derivation is self-contained from BERT outputs.
full rationale
The paper computes word/phrase reducibility directly from measured changes in BERT token representations after deletion, then uses those scores to induce trees. No equations fit parameters to target dependency trees, no self-citation chain justifies the core hypothesis, and no ansatz or uniqueness theorem is smuggled in. The skeptic concern (surprisal vs. syntax) addresses external validity, not internal reduction of the claimed derivation to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The magnitude of change in BERT representations after word deletion correlates with the syntactic reducibility of the deleted word.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Yonatan Belinkov. 2018. On Internal Language Representations in Deep Learning:An Analysis of Machine Translation and Speech Recognition . Ph.D. thesis, Massachusetts Institute of Technology
work page 2018
-
[4]
Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT: pre-training of deep bidirectional transformers for language understanding . CoRR, abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
John Hewitt and Christopher D. Manning. 2019. Structural Probe for Finding Syntax in Word Representations . In Proceedings of NAACL 2019
work page 2019
-
[6]
Sandra K \"u bler, Ryan T. McDonald, and Joakim Nivre. 2009. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers
work page 2009
-
[7]
Mark \' e ta Lopatkov \' a , Martin Pl \' a tek, and Vladislav Kubo n . 2005. Modeling syntax of free word-order languages: Dependency analysis by reduction. In Lecture Notes in Artificial Intelligence, Proceedings of the 8th International Conference, TSD 2005 , volume 3658 of Lecture Notes in Computer Science, pages 140--147, Berlin / Heidelberg. Springer
work page 2005
-
[8]
David Mare c ek and Zden e k Z abokrtsk\' y . 2012. Exploiting reducibility in unsupervised dependency parsing . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 297--307, Stroudsburg, PA, USA. Association for Computational Linguistics
work page 2012
-
[9]
Joakim Nivre, Mitchell Abrams, Z eljko Agi \'c , Lars Ahrenberg, Lene Antonsen, Katya Aplonova, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Victoria Basmov, John Bauer, Sandra Bellato, K...
work page 2018
-
[10]
Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.