pith. sign in

arxiv: 2605.15562 · v1 · pith:N6JSGJ2Nnew · submitted 2026-05-15 · 💻 cs.CL

GiLT: Augmenting Transformer Language Models with Dependency Graphs

Pith reviewed 2026-05-20 19:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelingdependency graphstransformersyntactic generalizationattention modulationgraph infusionsemantic dependenciesfine-tuning
0
0 comments X

The pith

Modulating Transformer attention with features from incrementally built dependency graphs improves syntactic generalization without extra tokens or perplexity loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GiLT to augment standard Transformer language models by injecting information from dependency graphs. Instead of adding structural tokens, it builds the graph step by step as tokens are predicted and uses extracted features to adjust attention weights inside the model. Experiments show this yields stronger performance on syntactic generalization tasks while keeping perplexity close to baseline levels. The same approach works when starting from an already trained language model and then fine-tuning for specific tasks.

Core claim

GiLT injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines and can be finetuned from a pretrained language model to achieve improved downstream task performance.

What carries the argument

Graph-infused layers that extract features from the incrementally constructed dependency graph and use them to modulate attention weights during token prediction.

If this is right

  • GiLT produces measurable gains in syntactic generalization over plain Transformer baselines.
  • Perplexity remains competitive with unmodified Transformer language models.
  • The method transfers to fine-tuning a pretrained language model for better results on downstream tasks.
  • Structural information enters the model without inserting additional special tokens into the sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The incremental graph construction might be adapted to other graph types such as semantic role graphs or discourse graphs.
  • Models trained this way could require fewer examples to reach a given level of syntactic accuracy on parsing-related tasks.
  • Combining the attention-modulation step with other structural signals such as constituency information could produce further gains.

Load-bearing premise

Features taken from a dependency graph built token by token can be used to adjust attention in a way that improves syntactic generalization without raising perplexity or needing extra tokens.

What would settle it

Running the same syntactic generalization benchmarks on a version of GiLT with the graph-modulation component removed and finding no drop in performance would show the dependency-graph features are not what drives the reported gains.

Figures

Figures reproduced from arXiv: 2605.15562 by Chuyan Zhou, Kewei Tu, Tianyu Huang, Yida Zhao.

Figure 1
Figure 1. Figure 1: Illustration of how the feature tape is recom [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scores on the 6 circuits of the SG test suites [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: visualization of attention scores of the first head in the last layer of GiLT (left) and TXL (right) given [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie-pie-oops/GiLT-LM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Graph-Infused Layers Transformer (GiLT), which augments Transformer language models by modulating attention weights using features extracted from dependency graphs that are incrementally constructed during token prediction. Unlike prior work that inserts extra structural tokens, GiLT avoids this overhead. Experiments claim that GiLT with semantic dependency graphs yields better syntactic generalization while preserving competitive perplexity relative to standard Transformer baselines, and that the model can be finetuned from a pretrained LM for improved downstream task performance.

Significance. If the empirical gains hold after proper controls, the approach offers an efficient route to infuse linguistic structure into LMs without token overhead or major perplexity degradation. The public code release supports reproducibility and is a clear strength.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and associated ablations: the central syntactic-generalization claim depends on reliable attention modulation from incrementally built dependency graphs on model-generated prefixes. No oracle-vs-predicted parse comparison or error-injection ablation is reported; without these, it is impossible to rule out that gains arise from incidental regularization in the modulation mechanism rather than the intended structural signal. This directly bears on whether the reported improvements support the main claim.
  2. [§3 (Method)] §3 (Method), attention-modulation equations: the precise mapping from graph features to attention-weight adjustments is underspecified with respect to additional parameters or learned components. If the modulation introduces extra degrees of freedom, the claim of a lightweight structural augmentation requires explicit quantification of parameter count relative to the baseline Transformer.
minor comments (2)
  1. The abstract and §1 should explicitly cite the dependency parser and graph-construction procedure (including any preprocessing of semantic dependencies) so readers can assess reproducibility.
  2. Table 1 (or equivalent results table) would benefit from reporting standard deviations over multiple runs and clearer baseline descriptions to allow direct comparison of perplexity and generalization metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments help clarify key aspects of our claims and method. We address each major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: §4 (Experiments) and associated ablations: the central syntactic-generalization claim depends on reliable attention modulation from incrementally built dependency graphs on model-generated prefixes. No oracle-vs-predicted parse comparison or error-injection ablation is reported; without these, it is impossible to rule out that gains arise from incidental regularization in the modulation mechanism rather than the intended structural signal. This directly bears on whether the reported improvements support the main claim.

    Authors: We agree that additional controls would strengthen the evidence supporting our central claim. In the revised manuscript we will add an error-injection ablation: we will randomly corrupt a fraction of the edges in the incrementally predicted dependency graphs and show that the syntactic generalization gains are substantially reduced, indicating that the improvements arise from the structural signal rather than incidental regularization effects of the modulation. We will also include a short discussion noting that our experiments use predicted parses (to reflect realistic deployment) while acknowledging that oracle parses would likely produce an upper bound on performance. revision: yes

  2. Referee: §3 (Method), attention-modulation equations: the precise mapping from graph features to attention-weight adjustments is underspecified with respect to additional parameters or learned components. If the modulation introduces extra degrees of freedom, the claim of a lightweight structural augmentation requires explicit quantification of parameter count relative to the baseline Transformer.

    Authors: We appreciate this observation. The attention modulation in GiLT is a deterministic, parameter-free function that derives multiplicative adjustments directly from the extracted dependency-graph features (edge labels and distances) and applies them to the attention scores; no additional trainable weights or learned components are introduced. Consequently, GiLT has exactly the same parameter count as the baseline Transformer. In the revision we will expand Section 3 with the complete modulation equations and explicitly report this parameter equivalence to substantiate the lightweight claim. revision: yes

Circularity Check

0 steps flagged

Empirical augmentation technique with no load-bearing derivations or self-citation chains

full rationale

The paper presents GiLT as an architectural augmentation that modulates Transformer attention using features from an incrementally built dependency graph during token prediction. No equations, derivations, or fitted parameters are described that reduce to the experimental inputs by construction. The central claims rest on empirical comparisons of perplexity and syntactic generalization against baselines, with no self-definitional loops, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The method is self-contained as a proposed engineering technique whose validity is tested externally via held-out metrics rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the method is presented as an empirical augmentation relying on standard dependency parsing and Transformer components.

pith-pipeline@v0.9.0 · 5687 in / 1009 out tokens · 37380 ms · 2026-05-20T19:36:17.011341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  2. [2]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  3. [3]

    Attention Is

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and ukasz Kaiser,. Attention Is. Advances in. 2017 , volume =

  4. [4]

    Trends in Cognitive Sciences , author =

    Structures,. Trends in Cognitive Sciences , author =. 2015 , pages =. doi:10.1016/j.tics.2015.09.008 , abstract =

  5. [5]

    Recurrent Neural Network Grammars

    Dyer, Chris and Kuncoro, Adhiguna and Ballesteros, Miguel and Smith, Noah A. Recurrent Neural Network Grammars. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. doi:10.18653/v1/N16-1024

  6. [6]

    Unsupervised Recurrent Neural Network Grammars

    Kim, Yoon and Rush, Alexander and Yu, Lei and Kuncoro, Adhiguna and Dyer, Chris and Melis, G \'a bor. Unsupervised Recurrent Neural Network Grammars. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1114

  7. [7]

    Parsing as Language Modeling

    Choe, Do Kook and Charniak, Eugene. Parsing as Language Modeling. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1257

  8. [8]

    Effective Batching for Recurrent Neural Network Grammars

    Noji, Hiroshi and Oseki, Yohei. Effective Batching for Recurrent Neural Network Grammars. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.380

  9. [9]

    Neural language models as psycholinguistic subjects: Representations of syntactic state

    Futrell, Richard and Wilcox, Ethan and Morita, Takashi and Qian, Peng and Ballesteros, Miguel and Levy, Roger. Neural language models as psycholinguistic subjects: Representations of syntactic state. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Lo...

  10. [10]

    Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

    Zhao, Yida and Lou, Chao and Tu, Kewei. Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.84

  11. [11]

    Generative Incremental Dependency Parsing with Neural Networks

    Buys, Jan and Blunsom, Phil. Generative Incremental Dependency Parsing with Neural Networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3115/v1/P15-2142

  12. [12]

    Dependency Recurrent Neural Language Models for Sentence Completion

    Mirowski, Piotr and Vlachos, Andreas. Dependency Recurrent Neural Language Models for Sentence Completion. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3115/v1/P15-2084

  13. [13]

    Structural Guidance for Transformer Language Models

    Qian, Peng and Naseem, Tahira and Levy, Roger and Fernandez Astudillo, Ram \'o n. Structural Guidance for Transformer Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.289

  14. [14]

    Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

    Murty, Shikhar and Sharma, Pratyusha and Andreas, Jacob and Manning, Christopher. Pushdown Layers: Encoding Recursive Structure in Transformer Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.195

  15. [15]

    Transactions of the Association for Computational Linguistics , volume =

    Sartran, Laurent and Barrett, Samuel and Kuncoro, Adhiguna and Stanojević, Miloš and Blunsom, Phil and Dyer, Chris , title =. Transactions of the Association for Computational Linguistics , volume =. 2022 , month =. doi:10.1162/tacl_a_00526 , url =

  16. [16]

    Composition, Attention, or Both?

    Yoshida, Ryo and Oseki, Yohei. Composition, Attention, or Both?. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.428

  17. [17]

    Statistical machine translation using labeled semantic dependency graphs

    Aue, Anthony and Menezes, Arul and Moore, Bob and Quirk, Chris and Ringger, Eric. Statistical machine translation using labeled semantic dependency graphs. Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages. 2004

  18. [18]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

    Integrating Vision-Language Semantic Graphs in Multi-View Clustering , author =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,. 2024 , month =. doi:10.24963/ijcai.2024/472 , url =

  19. [19]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month =

    Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger. A Systematic Assessment of Syntactic Generalization in Neural Language Models. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.158

  20. [20]

    doi:10.35111/FWEW-DA58 , abstract =

    Charniak, Eugene and. doi:10.35111/FWEW-DA58 , abstract =

  21. [21]

    doi: 10.18653/v1/P19-1285

    Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan. Transformer- XL : Attentive Language Models beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1285

  22. [22]

    Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling

    Prange, Jakob and Schneider, Nathan and Kong, Lingpeng. Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.325

  23. [23]

    Simpler but More Accurate Semantic Dependency Parsing

    Dozat, Timothy and Manning, Christopher D. Simpler but More Accurate Semantic Dependency Parsing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2077

  24. [24]

    Effective Inference for Generative Neural Parsing

    Stern, Mitchell and Fried, Daniel and Klein, Dan. Effective Inference for Generative Neural Parsing. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1178

  25. [25]

    Tree Transformer: Integrating Tree Structures into Self-Attention

    Wang, Yaushian and Lee, Hung-Yi and Chen, Yun-Nung. Tree Transformer: Integrating Tree Structures into Self-Attention. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1098

  26. [26]

    P a LM : A Hybrid Parser and Language Model

    Peng, Hao and Schwartz, Roy and Smith, Noah A. P a LM : A Hybrid Parser and Language Model. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1376

  27. [27]

    Guiding Attention for Self-Supervised Learning with Transformers

    Deshpande, Ameet and Narasimhan, Karthik. Guiding Attention for Self-Supervised Learning with Transformers. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.419

  28. [28]

    Automated Concatenation of Embeddings for Structured Prediction

    Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei. Automated Concatenation of Embeddings for Structured Prediction. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Pa...

  29. [29]

    International Conference on Learning Representations , year=

    Deep Biaffine Attention for Neural Dependency Parsing , author=. International Conference on Learning Representations , year=

  30. [30]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

  31. [31]

    BL i MP : The Benchmark of Linguistic Minimal Pairs for E nglish

    Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R. BL i MP : The Benchmark of Linguistic Minimal Pairs for E nglish. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00321

  32. [32]

    Proceedings of the 2018

    Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel. GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP. 2018. doi:10.18653/v1/W18-5446

  33. [33]

    S em E val 2015 Task 18: Broad-Coverage Semantic Dependency Parsing

    Oepen, Stephan and Kuhlmann, Marco and Miyao, Yusuke and Zeman, Daniel and Cinkov \'a , Silvie and Flickinger, Dan and Haji c , Jan and Ure s ov \'a , Zde n ka. S em E val 2015 Task 18: Broad-Coverage Semantic Dependency Parsing. Proceedings of the 9th International Workshop on Semantic Evaluation ( S em E val 2015). 2015. doi:10.18653/v1/S15-2153

  34. [34]

    Dan Flickinger , title =. Nat. Lang. Eng. , volume =. 2000 , url =

  35. [35]

    Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories

    Flickinger, Daniel and Zhang, Yi and Kordoni, Valia , title =. Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories. International Workshop on Treebanks and Linguistic Theories (TLT-11), 11th, November 30-December 1, Lisbon, Portugal , year =

  36. [36]

    Miyao, Yusuke

    From linguistic theory to syntactic analysis : corpus-oriented grammar development and feature forest model , author="Miyao, Yusuke", year=

  37. [37]

    International Conference on Language Resources and Evaluation , year=

    Announcing Prague Czech-English Dependency Treebank 2.0 , author=. International Conference on Language Resources and Evaluation , year=

  38. [38]

    Computational Linguistics , volume =

    Palmer, Martha and Gildea, Daniel and Kingsbury, Paul , title =. Computational Linguistics , volume =. 2005 , month =. doi:10.1162/0891201053630264 , url =

  39. [39]

    Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale

    Hu, Xiang and Ji, Pengyu and Zhu, Qingyang and Wu, Wei and Tu, Kewei. Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.145

  40. [40]

    2019 , institution=

    Language Models are Unsupervised Multitask Learners , author=. 2019 , institution=