Unraveling Syntax: Language Modeling and the Substructure of Grammars

Daniel Mitropolsky; Laura Ying Schulz; Tomaso Poggio

arxiv: 2510.02524 · v3 · pith:E3JT3FVOnew · submitted 2025-10-02 · 💻 cs.CL · cs.FL· cs.LG

Unraveling Syntax: Language Modeling and the Substructure of Grammars

Laura Ying Schulz , Daniel Mitropolsky , Tomaso Poggio This is my paper

classification 💻 cs.CL cs.FLcs.LG

keywords languagesubgrammarsmodelingmodelssubstructurecfgsgrammargrammars

0 comments

read the original abstract

While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest -- such as natural language syntax, coding languages, arithmetic -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Structure Before Collapse: Transient semantic geometry in next-token prediction
cs.LG 2026-06 unverdicted novelty 7.0

Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
Diagnosing CFG Interpretation in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
Sampling Data with Chains of Forward-Backward Diffusion Steps
cs.LG 2026-05 unverdicted novelty 5.0

U-turn chains are Markov chains formed by short forward-backward diffusion steps that remain on the learned manifold and, with Metropolis-Hastings, sample from energy-modified targets, exhibiting an ergodicity-breakin...