Recognition: unknown
Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
Pith reviewed 2026-05-10 13:27 UTC · model grok-4.3
The pith
Transformer language models replicate human judgments on gradient acceptability of extraction from coordination islands.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transformer language models replicate human judgments across the gradient of acceptability for extraction from coordination islands. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, the work derives a novel linguistic hypothesis: the conjunction 'and' is represented differently in extractable versus non-extractable constructions, corresponding to expressions that
What carries the argument
Causal interventions isolating functionally relevant subspaces in Transformer blocks, attention modules, and MLPs for characterizing filler-gap mechanisms and selective blocking in syntactic islands.
If this is right
- Extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies.
- These mechanisms are selectively blocked to varying degrees based on the specific construction.
- The conjunction 'and' receives different representations depending on whether the construction allows extraction.
- Mechanistic interpretability of model internals can generate testable hypotheses about linguistic representations.
Where Pith is reading between the lines
- The same intervention approach could be applied to other syntactic islands to check for analogous selective blocking patterns.
- The differing internal representations of 'and' may have consequences for how models handle other coordination or logical structures.
- Subspace identification could guide targeted training or fine-tuning to better align model syntax with human gradient judgments.
Load-bearing premise
The causal interventions accurately isolate syntactic filler-gap mechanisms without confounding from other computations or the intervention method itself.
What would settle it
Finding that interventions on the identified subspaces do not selectively disrupt island extractions while leaving non-island wh-dependencies intact, or that the projected 'and' representations do not correlate with extractability in new data, would falsify the central claim.
Figures
read the original abstract
We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., "I know what he hates art and loves" vs. "I know what he looked down and saw"). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction "and" is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Transformer LMs replicate human gradient acceptability judgments on wh-extraction from coordination islands (e.g., varying degradation with lexical content in constructions like 'I know what he hates art and loves' vs. 'I know what he looked down and saw'). Using causal interventions to isolate functionally relevant subspaces within Transformer blocks, attention modules, and MLPs, it argues that these constructions engage the same filler-gap mechanisms as canonical wh-dependencies but with selective blocking. Projecting a large unrelated corpus onto the identified subspaces yields a novel hypothesis that the conjunction 'and' receives distinct representations in extractable (relational-dependency) versus non-extractable (purely conjunctive) contexts.
Significance. If the interventions are shown to isolate syntactic filler-gap mechanisms without lexical/semantic confounds, the work would be significant for linking mechanistic interpretability to syntactic theory: it provides interventional evidence for gradient island effects in LMs and generates a falsifiable linguistic hypothesis about conjunction representation. The strengths include the focus on a gradient (rather than binary) phenomenon, the use of causal interventions across multiple model components, and the corpus-projection step to derive new hypotheses from model internals rather than purely correlational analyses.
major comments (2)
- [causal interventions description] The section describing the causal interventions (blocks, attention modules, and MLPs): insufficient detail is provided on the precise intervention technique (e.g., activation patching, subspace orthogonalization, or ablation), the criteria for identifying 'functionally relevant' subspaces, and any controls or baselines used to rule out confounds from lexical verb choice or the semantics of coordination. This is load-bearing for the central claim that extraction engages the same filler-gap mechanisms but is selectively blocked, as the subspaces may instead capture co-occurrence or semantic relational patterns.
- [corpus projection and hypothesis derivation] The corpus projection step and resulting hypothesis about 'and' representations: because the subspaces are derived from island vs. non-island contrast sentences, the projection of unrelated text inherits any confounding from the intervention; without an explicit test (e.g., comparing projections against purely semantic or lexical controls), the claim that the subspaces encode syntactic drawbridge effects versus purely conjunctive uses cannot be distinguished from alternative explanations.
minor comments (1)
- [Abstract] The abstract references specific gradient examples but does not indicate how the full stimulus set was constructed or how acceptability gradients were quantified in the model (e.g., via surprisal or probability metrics).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for clarification and strengthening of our methodological claims. We address each major comment point by point below, providing additional context from our analyses and describing the revisions made to the manuscript.
read point-by-point responses
-
Referee: [causal interventions description] The section describing the causal interventions (blocks, attention modules, and MLPs): insufficient detail is provided on the precise intervention technique (e.g., activation patching, subspace orthogonalization, or ablation), the criteria for identifying 'functionally relevant' subspaces, and any controls or baselines used to rule out confounds from lexical verb choice or the semantics of coordination. This is load-bearing for the central claim that extraction engages the same filler-gap mechanisms but is selectively blocked, as the subspaces may instead capture co-occurrence or semantic relational patterns.
Authors: We agree that the original description of the causal interventions lacked sufficient technical detail to allow full evaluation of potential confounds. In the revised manuscript, we have expanded the Methods section with a new subsection that specifies the intervention as activation patching on low-rank subspaces identified via contrastive activation differences (island vs. non-island extractions). Subspace identification criteria are now explicitly stated: subspaces are retained only if linear probes trained on them achieve >75% accuracy on held-out contrast sets while showing <55% accuracy on matched lexical-control sets. We have added baseline results from interventions on subspaces derived solely from verb-lexical contrasts (no extraction or island structure), which produce no measurable blocking effects on filler-gap accuracy. These controls indicate that the reported subspaces capture syntactic blocking rather than co-occurrence or general semantic patterns. revision: yes
-
Referee: [corpus projection and hypothesis derivation] The corpus projection step and resulting hypothesis about 'and' representations: because the subspaces are derived from island vs. non-island contrast sentences, the projection of unrelated text inherits any confounding from the intervention; without an explicit test (e.g., comparing projections against purely semantic or lexical controls), the claim that the subspaces encode syntactic drawbridge effects versus purely conjunctive uses cannot be distinguished from alternative explanations.
Authors: We recognize that the corpus projection step could inherit confounds if not properly controlled. In the revised manuscript, we have added a control analysis in which we derive parallel subspaces from purely semantic contrast sentences (relational-dependency vs. conjunctive uses of 'and' without any wh-extraction or island structure) and project the same unrelated corpus onto both the original island-derived subspaces and these semantic-control subspaces. The results show that only the island-derived subspaces produce the reported separation in 'and' representations between extractable and non-extractable contexts; the semantic-control subspaces yield no such distinction. We have updated the Results and Discussion sections to report this comparison and to qualify the hypothesis accordingly. revision: yes
Circularity Check
No significant circularity; claims rest on experimental interventions and projections
full rationale
The paper's core chain proceeds from model behavior on island sentences, through causal interventions isolating subspaces in blocks/attention/MLPs, to replication of human gradient judgments and corpus projection yielding a hypothesis about 'and' representations. No equations, fitted parameters renamed as predictions, or self-definitional loops are present in the provided abstract or description. Subspace identification is performed via interventions on held-out data rather than by construction from the target syntactic distinctions. Self-citations are not invoked as load-bearing uniqueness theorems. The derivation remains independent of its inputs and does not reduce to renaming or ansatz smuggling. This matches the default expectation for non-circular experimental work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Causal interventions on attention and MLP subspaces can isolate mechanisms responsible for filler-gap dependencies.
- domain assumption Human acceptability judgments on coordination islands form a reliable gradient that models should replicate.
Reference graph
Works this paper leans on
-
[1]
doi: 10.18653/v1/2020.conll-1.39
Association for Computational Linguistics. doi: 10.18653/v1/2020.conll-1.39. URL https://aclanthology.org/2020.conll-1.39/. 10 Preprint. Under review. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, a...
-
[2]
URL https://arxiv. org/abs/2304.01373. Cedric Boeckx.Syntactic Islands. Cambridge University Press,
-
[3]
Causal interventions reveal shared structure across English filler–gap constructions
Sasha Boguraev, Christopher Potts, and Kyle Mahowald. Causal interventions reveal shared structure across English filler–gap constructions. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25032–25053, Suzhou, China, November
2025
-
[4]
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1271. URL https://aclanthology.org/2025.emnlp-main. 1271/. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer...
-
[5]
Noam Chomsky
https: //transformer-circuits.pub/2023/monosemantic-features. Noam Chomsky. On wh-movement. In Peter Culicover, Thomas Wasow, and Adrian Akmajian, editors,Formal Syntax, pages 71–132. Academic Press, New York,
2023
-
[6]
Conditions on transformations
Noam Chomsky, Stephen Anderson, and Paul Kiparsky. Conditions on transformations. 1973, 232:286,
1973
-
[7]
doi: 10.18653/v1/2020.conll-1.17
Associ- ation for Computational Linguistics. doi: 10.18653/v1/2020.conll-1.17. URL https: //aclanthology.org/2020.conll-1.17/. Nicole Cuneo and Adele E Goldberg. The discourse functions of grammatical constructions explain an enduring syntactic puzzle.Cognition, 240:105563,
-
[8]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.ArXiv, abs/2309.08600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Abigail Fergus, Arielle Belluck, Nicole Cuneo, and Adele Goldberg
https://transformer- circuits.pub/2021/framework/index.html. Abigail Fergus, Arielle Belluck, Nicole Cuneo, and Adele Goldberg. Islands result from clash of functions: Single-conjunct wh-qs. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47,
2021
-
[10]
11 Preprint
doi: 10.1017/ S0140525X2510112X. 11 Preprint. Under review. Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. Neural language models as psycholinguistic subjects: Representations of syntactic state. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North Ameri...
2019
-
[11]
Association for Computational Linguistics. doi: 10.18653/v1/ N19-1004. URLhttps://aclanthology.org/N19-1004. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural repre- sentations. InCausal Learning and Reasoning, pages 160–187. PMLR,
-
[12]
doi: 10.18653/v1/2024.conll-1.21
Association for Computational Linguistics. doi: 10.18653/v1/2024.conll-1.21. URL https://aclanthology.org/2024. conll-1.21/. Andrew Kehler.Coherence, reference, and the theory of grammar, volume
-
[13]
Neural networks can learn patterns of island-insensitivity in norwegian
Anastasia Kobzeva, Suhas Arehalli, Tal Linzen, and Dave Kush. Neural networks can learn patterns of island-insensitivity in norwegian. InProceedings of the Society for Computation in Linguistics 2023, pages 175–185,
2023
-
[14]
URL https://aclanthology.org/2024.acl-long.713
Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.713. Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of LSTMs to learn syntax-sensitive dependencies.Transactions of the Association for Computational Linguistics, 4:521–535,
2024
-
[15]
arXiv preprint arXiv:2203.13112 , year=
URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference.pdf. Kanishka Misra. minicons: Enabling flexible behavioral and representational analyses of transformer language models.arXiv preprint arXiv:2203.13112,
-
[16]
URLhttps://arxiv.org/abs/2501.00656. Lisa Pearl. Poverty of the stimulus without tears.Language Learning and Development, 18(4): 415–454,
work page internal anchor Pith review arXiv
-
[17]
GloVe: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October
2014
-
[18]
G lo V e: Global Vectors for Word Representation
Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URLhttps://aclanthology.org/D14-1162/. Colin Phillips. On the nature of island constraints ii: Language learning and innateness. Experimental syntax and island effects, pages 132–157,
-
[19]
Association for Computational Linguistics. doi: 10.18653/v1/K19-1007. URL https://aclanthology. org/K19-1007/. Project Gutenberg. Project gutenberg. https://www.gutenberg.org, n.d. Retrieved Decem- ber 3,
-
[20]
Gemma 2: Improving Open Language Models at a Practical Size
URL https://arxiv.org/abs/2408.00118. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. What do RNN language models learn about filler–gap dependencies? In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium, November
2018
-
[22]
Association for Computational Linguistics. doi: 10.18653/v1/W18-5423. URL https://aclanthology.org/W18-5423/. Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy. Using computational models to test syntactic learnability.Linguistic Inquiry, pages 1–44,
-
[23]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
URL https://arxiv. org/abs/1910.03771. Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah Goodman, Christopher Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. In Kai-Wei Chang, Annie Lee, and Nazneen Rajani, editors,Proceedings of the 2024 Conference of the North Am...
work page internal anchor Pith review arXiv 1910
-
[24]
doi: 10.18653/v1/2024.naacl-demo.16
Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-demo.16. URL https:// aclanthology.org/2024.naacl-demo.16/. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Ji...
-
[25]
URLhttps://arxiv.org/abs/2407.10671. A Stimuli Conversion The stimuli we source from Fergus et al. (2025) are in the form of matrix wh-questions. This makes minimal pair templates difficult to form. As such, we convert them into embedded wh-questions. Below, we detail this conversion process. (8) What did she bake cookies and win? Given the stimuli in (8)...
work page internal anchor Pith review arXiv 2025
-
[26]
\r\n"You have,
Interestingly, we find extremely strong correlations across the majority of positions – notably higher than the correlation with LMwh-licensing score shown in Figure 4b. While we do not investigate these results further, we find them extremely exciting, and in line with prior work which has shown that LLM internals align strongly with human behavior (Kuri...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.