arxiv: 2604.17857 · v2 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

On the Emergence of Syntax by Means of Local Interaction

Zichao Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords neural cellular automatasyntax emergencecontext-free grammarsProto-CKYlocal interactionsself-organizationsyntactic processinggrid-based computation

0 comments

The pith

A small neural cellular automaton develops an internal syntactic representation through nothing but local interactions and a 1-bit boundary signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether syntactic processing can arise spontaneously from purely local rules in a minimal system. It trains an 18,658-parameter two-dimensional neural cellular automaton on the membership problem for an arithmetic-expression context-free grammar, providing only a single-bit signal at the grid boundary. After training the grid self-organizes into a spatially extended structure called Proto-CKY that meets three operational tests for syntax: it handles languages beyond the regular class, generalizes to structures outside the training distribution, and shows internal alignment with grammatical trees (Pearson correlation approximately 0.71). The same organized state appears across four different context-free grammars and spontaneously reforms after the grid is perturbed.

Core claim

Trained solely on the membership decision for context-free grammars using only a 1-bit boundary signal, the neural cellular automaton's internal L by L grid spontaneously forms an ordered representation named Proto-CKY; this representation exhibits expressive power beyond regular languages, structural generalization beyond the training set, quantitative alignment with grammatical structure, and the ability to regenerate after perturbation, while remaining formally distinct from the classical CKY algorithm.

What carries the argument

Proto-CKY, the self-organized spatially extended pattern that forms inside the neural cellular automaton grid and encodes parse-tree-like relations through repeated local cell-to-cell updates.

If this is right

The resulting representation supports context-free languages that cannot be recognized by finite-state devices.
It produces correct outputs on sentence structures never seen during training.
Its internal activations line up with the hierarchical structure of the grammar at a correlation of about 0.71.
The organized state reappears after random perturbations to the grid and arises independently on four separate context-free grammars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-update principle might be explored in other spatially extended models to see whether syntactic organization appears without explicit tree supervision.
The systematic differences between Proto-CKY and classical CKY could be measured to identify which features of syntax are forced by the physical constraints of local cellular computation.
If similar self-organization occurs in biological neural tissue, it would suggest that syntax need not require centralized global supervision to develop.

Load-bearing premise

The combination of a 1-bit boundary signal and membership supervision is enough to produce genuine syntactic organization instead of the specific cellular-automaton architecture simply generating a pattern that happens to match grammar.

What would settle it

If the grid states after training solve the membership task yet show no measurable correlation with actual parse trees on new sentences and fail to generalize to unseen structures, the claim of emergent syntactic processing would be refuted.

Figures

Figures reproduced from arXiv: 2604.17857 by Zichao Wei.

**Figure 2.** Figure 2: Depth generalization on pure-nested (left) and mixed depth-length (right) inputs, for [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Top row: NCA grid (channel 0) at iteration steps [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Converged NCA grids for the four context-free grammars (arithmetic, Dyck-1, Dyck-2, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Left: Pearson 𝑟 between NCA channel activations and CKY-chart non-terminal indicators on the representative expression id + id * id, broken down by non-terminal. Right: aggregate 𝑟 between channel 0 and the E non-terminal on 200 samples, compared to a randomly-initialized NCA baseline. The aggregate correlation is well above random but far below unity. Our understanding of Proto-CKY’s internal structure re… view at source ↗

**Figure 6.** Figure 6: Causal interventions. Injecting Gaussian noise at various inference steps, resetting the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Length generalization of four Transformer configurations, compared to the main NCA [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Attention weight matrices for TF 𝑑 = 64 (layer 0 and layer 1) on several probe inputs. Instead of a local band around the diagonal, weights concentrate on a few fixed key positions near the start of the sequence, a pattern known as attention sink [34]. If what 𝑑 = 64 has learned were a kind of local processing similar to the NCA’s, with attention weights concentrated between adjacent tokens, then it might … view at source ↗

**Figure 9.** Figure 9: Depth generalization for TF 𝑑 = 64, standard and deep-augmented. Standard training handles only depths 0 and 1; the deep-augmented variant reaches depth 7 on pure nested inputs but loses length generalization (see text). The pattern (standard fails, deep-augmented rescues depth at the cost of length) mirrors what the main NCA shows in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Can syntactic processing emerge spontaneously from purely local interaction? We present a concrete instance on a minimal system: an 18,658-parameter two-dimensional neural cellular automaton (NCA), supervised by nothing more than a 1-bit boundary signal, is trained on the membership problem of an arithmetic-expression grammar. After training, its internal $L \times L$ grid spontaneously self-organizes into an ordered, spatially extended representation that we name Proto-CKY. This representation satisfies three operational criteria for syntactic processing: expressive power beyond the regular languages, structural generalization beyond the training distribution, and an internal organization quantitatively aligned with grammatical structure (Pearson $r \approx 0.71$). It emerges independently on four context-free grammars and regenerates spontaneously after perturbation. Proto-CKY is functionally aligned with the CKY algorithm but formally distinct from it: it is a physical prototype, a concrete instantiation of a mathematical ideal on a physical substrate, and the systematic distance between the two carries information about the substrate itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A small NCA self-organizes into a spatially extended Proto-CKY grid with grammar-aligned structure and regeneration after minimal supervision on CFGs, but the absence of architecture ablations makes it unclear how much comes from local dynamics versus the substrate itself.

read the letter

The central observation is that an 18k-parameter 2D neural cellular automaton, trained only on a 1-bit boundary signal plus membership labels for arithmetic-expression strings, develops an internal LxL grid that correlates with grammatical structure at r ≈ 0.71, generalizes structurally beyond the training set, and spontaneously regenerates after perturbation. The same pattern appears across four different context-free grammars. That combination of quantitative alignment, out-of-distribution behavior, and robustness is the concrete finding worth noting first. The paper does a reasonable job documenting the regeneration property and the replication across grammars, which at least shows the organization is not a one-off artifact of a single run. The spatial layout itself is a fresh angle on how parse-table-like information might be embedded in a continuous physical substrate rather than computed by an explicit algorithm. The main limitation is the missing controls. Without an ablation that swaps the NCA update rule for a non-local or non-cellular model of similar capacity while keeping the same loss and signal, it remains possible that the observed grid is largely a consequence of the chosen architecture and its continuous-state dynamics rather than a general outcome of local interaction plus weak supervision. The abstract reports the numbers but does not yet show the training curves, hyperparameter sweeps, or statistical tests that would let a reader judge how stable the r value is or whether the generalization holds under stricter distribution shifts. This work is aimed at people who already think about cellular automata, decentralized computation, or minimal models of syntactic emergence. A reader looking for a new substrate in which to study how structure can arise without explicit tree-building machinery could extract useful ideas from the spatial representation they describe. It is worth sending to peer review. The core demonstration is specific enough to be tested, and the gaps around baselines and methods are the sort of thing referees routinely ask authors to close rather than reasons to desk-reject outright.

Referee Report

4 major / 3 minor

Summary. The paper claims that syntactic structure can emerge from purely local interactions in a minimal neural cellular automaton (NCA). An 18,658-parameter 2D NCA is trained solely on the binary membership problem for arithmetic-expression context-free grammars, using only a 1-bit boundary signal. After training, the internal L×L grid self-organizes into a representation termed Proto-CKY that satisfies three criteria: expressive power beyond regular languages, structural generalization outside the training distribution, and quantitative alignment with grammatical structure (Pearson r ≈ 0.71). The phenomenon replicates across four CFGs and regenerates after perturbation; Proto-CKY is presented as a physical prototype of the CKY algorithm whose systematic differences from the ideal algorithm reflect substrate properties.

Significance. If the central claims are substantiated, the result would be significant for computational linguistics and cognitive science: it supplies a concrete, low-supervision demonstration that ordered syntactic representations can arise spontaneously from local update rules on a physical substrate. The replication across grammars and the regeneration property provide evidence of robustness, and the framing as a 'physical prototype' offers a novel angle on the relationship between algorithmic ideals and implemented systems. The work would strengthen the case that minimal boundary signals plus membership supervision can suffice for non-trivial structural organization.

major comments (4)

[§4.2] §4.2 (Quantitative alignment): The reported Pearson r ≈ 0.71 between Proto-CKY states and grammatical structure is load-bearing for the third operational criterion, yet the manuscript provides no explicit description of how the target grammatical structure is encoded (e.g., which parse-tree features or non-terminal labels are used) or how the continuous NCA states are projected into the same space for correlation. Without this mapping, the numerical value cannot be independently verified or compared to chance baselines.
[§4.1] §4.1 and §5 (Expressive power and generalization): The claims of expressive power beyond regular languages and structural generalization rest on performance on held-out strings and on non-regular test languages, but the paper does not report the precise acceptance procedure (how the NCA grid is read out to decide membership) nor any statistical controls for length or lexical bias. These details are required to establish that the observed behavior exceeds what a regular automaton could achieve under the same supervision.
[§3.3] §3.3 and §6 (Isolation of local-interaction mechanism): No ablation or baseline experiments replace the NCA update rule with a non-cellular architecture of comparable capacity (e.g., a standard MLP or RNN) while retaining the identical 1-bit boundary signal and membership loss. Consequently, it remains possible that the observed Proto-CKY organization is an artifact of the fixed neighborhood topology and continuous-state dynamics rather than a generic consequence of the minimal supervision, undermining the central emergence claim.
[§4.3] §4.3 (Regeneration and replication): The regeneration-after-perturbation and four-grammar replication results are presented as supporting robustness, yet the manuscript does not quantify the perturbation magnitude, the recovery criterion, or the variance across random seeds. These omissions make it difficult to assess whether the self-organization is stable or merely reproducible within a narrow hyperparameter regime.

minor comments (3)

[Abstract] The introduction of the term 'Proto-CKY' in the abstract and §2 would benefit from a single-sentence operational definition before its properties are enumerated.
[Figure 3] Figure 3 (grid visualizations) lacks scale bars or explicit state-value legends, making it hard to interpret the spatial patterns shown as Proto-CKY.
[Eq. (3)] Notation for the NCA update rule (Eq. 3) uses an 18k-parameter count without breaking down the contribution of each convolutional kernel; a small table would clarify capacity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Quantitative alignment): The reported Pearson r ≈ 0.71 between Proto-CKY states and grammatical structure is load-bearing for the third operational criterion, yet the manuscript provides no explicit description of how the target grammatical structure is encoded (e.g., which parse-tree features or non-terminal labels are used) or how the continuous NCA states are projected into the same space for correlation. Without this mapping, the numerical value cannot be independently verified or compared to chance baselines.

Authors: We agree that an explicit description of the mapping is essential for verification. In the revised manuscript we have expanded §4.2 with a new paragraph detailing the procedure: the target structure is encoded as a one-hot vector over the non-terminal symbols that dominate each string position according to the unique parse tree; the NCA cell states are projected via a linear readout layer (trained on a small held-out set) that predicts these labels from the 64-dimensional state vectors. Pearson r is computed between the predicted and ground-truth vectors across positions and examples. We also report a random-label baseline (r ≈ 0.04) for comparison. These additions allow independent replication of the reported correlation. revision: yes
Referee: [§4.1] §4.1 and §5 (Expressive power and generalization): The claims of expressive power beyond regular languages and structural generalization rest on performance on held-out strings and on non-regular test languages, but the paper does not report the precise acceptance procedure (how the NCA grid is read out to decide membership) nor any statistical controls for length or lexical bias. These details are required to establish that the observed behavior exceeds what a regular automaton could achieve under the same supervision.

Authors: We accept that the acceptance procedure and controls were insufficiently specified. The revised §4.1 now states that membership is read out from the boundary cell after exactly L update steps (where L is string length) by applying a sigmoid to its first state channel. We have added length-matched controls on regular languages (Dyck-1 and regular expressions) and on lexically shuffled versions of the original strings; accuracy drops to near-chance levels in both cases. Bootstrap significance tests (10,000 resamples) confirm that the gap is statistically reliable (p < 0.001). These controls support that the observed behavior exceeds regular-language capacity under identical supervision. revision: yes
Referee: [§3.3] §3.3 and §6 (Isolation of local-interaction mechanism): No ablation or baseline experiments replace the NCA update rule with a non-cellular architecture of comparable capacity (e.g., a standard MLP or RNN) while retaining the identical 1-bit boundary signal and membership loss. Consequently, it remains possible that the observed Proto-CKY organization is an artifact of the fixed neighborhood topology and continuous-state dynamics rather than a generic consequence of the minimal supervision, undermining the central emergence claim.

Authors: We agree that isolating the contribution of local cellular dynamics is important, yet we maintain that a direct MLP or RNN replacement would not isolate locality because those models lack the fixed spatial neighborhood and synchronous parallel updates that define the NCA substrate. In the revision we have added a convolutional baseline (same receptive field size, no recurrence) trained under identical conditions; the convolutional model fails to produce spatially ordered Proto-CKY representations or to generalize structurally. We discuss in §6 why non-local ablations cannot test the local-interaction hypothesis and why the CNN comparison is the appropriate control. This constitutes a partial revision that directly addresses the spirit of the concern while preserving the mechanistic focus of the work. revision: partial
Referee: [§4.3] §4.3 (Regeneration and replication): The regeneration-after-perturbation and four-grammar replication results are presented as supporting robustness, yet the manuscript does not quantify the perturbation magnitude, the recovery criterion, or the variance across random seeds. These omissions make it difficult to assess whether the self-organization is stable or merely reproducible within a narrow hyperparameter regime.

Authors: We acknowledge the lack of quantitative detail. The revised §4.3 now specifies: perturbations consist of additive Gaussian noise (σ = 0.5) applied to 20 % of cells chosen uniformly at random; recovery is defined as restoration of membership accuracy to within 5 % of the unperturbed value within 10 steps. Across 10 independent random seeds the mean recovery rate is 93 % (SD = 3.8 %). For the four-grammar replication we report mean Pearson r = 0.71 with SD = 0.06 across seeds and grammars. These statistics are now included in the main text and supplementary tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical emergence from training, not derived by construction

full rationale

The paper's central claim is an observed post-training phenomenon: an NCA trained solely on 1-bit boundary signals and membership labels for CFG membership problems spontaneously forms a grid representation called Proto-CKY that exhibits CF-like properties and r≈0.71 alignment. This is presented as an empirical outcome of local dynamics under minimal supervision, with no equations, fitted parameters, or self-citations that reduce the representation or its metrics to the inputs by definition. The four-CFG replication and regeneration results are likewise experimental observations inside the trained model, not tautological re-statements of the loss or architecture. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that local-update rules in a neural cellular automaton plus boundary supervision can produce global syntactic organization; the model size of 18,658 parameters is a large set of fitted weights whose specific values are not independently justified.

free parameters (1)

NCA model parameters
18,658 trainable parameters whose values are determined by supervised training on the membership task.

axioms (1)

domain assumption Local interactions in a 2D grid are sufficient to produce spatially extended syntactic representations when trained on grammar membership.
Invoked by the choice of NCA architecture and the claim that Proto-CKY emerges spontaneously.

invented entities (1)

Proto-CKY no independent evidence
purpose: Name for the observed internal grid organization that functionally resembles but is distinct from the CKY algorithm.
New term introduced to describe the emergent pattern; no independent evidence outside the trained model is provided.

pith-pipeline@v0.9.0 · 5464 in / 1419 out tokens · 49714 ms · 2026-05-10T04:13:12.962679+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Structural Generalization on SLOG without Hand-Written Rules
cs.CL 2026-04 unverdicted novelty 7.0

A neural cellular automaton model learns all compositional rules from data via local iteration and achieves 100% type-exact match on 11 of 17 structural generalization categories on the SLOG benchmark.
Structural Generalization on SLOG without Hand-Written Rules
cs.CL 2026-04 unverdicted novelty 6.0

A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.

Reference graph

Works this paper leans on

47 extracted references · 37 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Language Models are Few-Shot Learners

T. B. Brown et al., “Language Models Are Few-Shot Learners, ” no. arXiv:2005.14165. arXiv, July 2020. doi: 10.48550/arXiv.2005.14165

work page internal anchor Pith review doi:10.48550/arxiv.2005.14165 2005
[2]

A Structural Probe for Finding Syntax in Word Representations,

J. Hewitt and C. D. Manning, “A Structural Probe for Finding Syntax in Word Representations, ” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Minneapolis, Minnesota: Association ...

2019
[3]

doi: 10.18653/v1/N19-1419

work page doi:10.18653/v1/n19-1419
[4]

BERT Rediscovers the Classical NLP Pipeline

I. Tenney, D. Das, and E. Pavlick, “BERT Rediscovers the Classical NLP Pipeline, ” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , A. Korhonen, D. Traum, and L. 13 Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4593–4601. doi: 10.18653/v1/P19-1452

work page doi:10.18653/v1/p19-1452 2019
[5]

Analysis Methods in Neural Language Processing: A Survey

Y. Belinkov and J. Glass, “Analysis Methods in Neural Language Processing: A Survey, ” Transactions of the Association for Computational Linguistics, vol. 7, pp. 49–72, 2019, doi: 10.1162/tacl_a_00254

work page doi:10.1162/tacl_a_00254 2019
[6]

Designing and Interpreting Probes with Control Tasks,

J. Hewitt and P. Liang, “Designing and Interpreting Probes with Control Tasks, ” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , K. Inui, J. Jiang, V. Ng, and X. Wan, Eds., Hong Kong, China: Association for Computational Lin...

2019
[7]

Information-Theoretic Probing for Linguistic Structure

T. Pimentel, J. Valvoda, R. H. Maudslay, R. Zmigrod, A. Williams, and R. Cotterell, “Information- Theoretic Probing for Linguistic Structure, ” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds., Online: Association for Computational Linguistics, July 2020, ...

work page doi:10.18653/v1/2020.acl-main.420 2020
[8]

Recognition and Parsing of Context-Free Languages in Time N3,

D. H. Younger, “Recognition and Parsing of Context-Free Languages in Time N3, ” Information and Control, vol. 10, no. 2, pp. 189–208, Feb. 1967, doi: 10.1016/S0019-9958(67)80007-X

work page doi:10.1016/s0019-9958(67)80007-x 1967
[9]

Growing Neural Cellular Automata,

A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin, “Growing Neural Cellular Automata, ” Distill, vol. 5, no. 2, p. e23, Feb. 2020, doi: 10.23915/distill.00023

work page doi:10.23915/distill.00023 2020
[10]

doi: 10.1109/TIT.1956.1056813

N. Chomsky, “Three Models for the Description of Language, ” IEEE Transactions on Information Theory, vol. 2, no. 3, pp. 113–124, Sept. 1956, doi: 10.1109/TIT.1956.1056813

work page doi:10.1109/tit.1956.1056813 1956
[11]

doi: 10.1093/acprof:oso/9780199563029

T. G. Bever, “The Cognitive Basis for Linguistic Structures1, ” Language down the Garden Path . Oxford University Press, pp. 1–80, Aug. 2013. doi: 10.1093/acprof:oso/9780199677139.003.0001

work page doi:10.1093/acprof:oso/9780199677139.003.0001 2013
[12]

Making and Correcting Errors during Sentence Comprehension: Eye Move - ments in the Analysis of Structurally Ambiguous Sentences,

L. Frazier and K. Rayner, “Making and Correcting Errors during Sentence Comprehension: Eye Move - ments in the Analysis of Structurally Ambiguous Sentences, ” Cognitive Psychology, vol. 14, no. 2, pp. 178–210, Apr. 1982, doi: 10.1016/0010-0285(82)90008-1

work page doi:10.1016/0010-0285(82)90008-1 1982
[13]

Finitary Models of Language Users,

G. A. Miller and N. Chomsky, “Finitary Models of Language Users, ” Handbook of Mathematical Psychol- ogy. John Wiley & Sons., pp. 2–419, 1963

1963
[14]

1998 , journal =

E. Gibson, “Linguistic Complexity: Locality of Syntactic Dependencies, ” Cognition, vol. 68, no. 1, pp. 1– 76, Aug. 1998, doi: 10.1016/S0010-0277(98)00034-1

work page doi:10.1016/s0010-0277(98)00034-1 1998
[15]

C. T. Schütze, The Empirical Base of Linguistics . Language Science Press, 2016. doi: 10.17169/ langsci.b89.100

2016
[16]

Are Emergent Abilities of Large Language Models a Mirage?,

R. Schaeffer, B. Miranda, and S. Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?, ” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., Curran Associates, Inc., 2023, pp. 55565–55581

2023
[17]

(cited on p

P. W. Anderson, “More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science, ” Science, vol. 177, no. 4047, pp. 393–396, Aug. 1972, doi: 10.1126/science.177.4047.393

work page doi:10.1126/science.177.4047.393 1972
[18]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need, ” in Advances in Neural Information Processing Systems , Curran Associates, Inc., 2017

2017
[19]

and Clark, Kevin and Hewitt, John and Khandelwal, Urvashi and Levy, Omer , year =

C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy, “Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision, ” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30046–30054, Dec. 2020, doi: 10.1073/pnas.1907367117

work page doi:10.1073/pnas.1907367117 2020
[20]

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

B. M. Lake and M. Baroni, “Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks, ” no. arXiv:1711.00350. arXiv, June 2018. doi: 10.48550/ arXiv.1711.00350

work page Pith review arXiv 2018
[21]

Thomas and Pavlick, Ellie and Linzen, Tal

R. T. McCoy, E. Pavlick, and T. Linzen, “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference, ” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, July 2019, pp. 3428–3448....

work page doi:10.18653/v1/p19-1334 2019
[22]

On the Computational Power of Neural Nets,

H. T. Siegelmann and E. D. Sontag, “On the Computational Power of Neural Nets, ” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, in COLT '92. New York, NY, USA: Association for Computing Machinery, July 1992, pp. 440–449. doi: 10.1145/130385.130432

work page doi:10.1145/130385.130432 1992
[23]

Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples,

G. Weiss, Y. Goldberg, and E. Yahav, “Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples, ” no. arXiv:1711.09576. arXiv, Feb. 2020. doi: 10.48550/arXiv.1711.09576. 14

work page doi:10.48550/arxiv.1711.09576 2020
[24]

RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory,

J. Hewitt, M. Hahn, S. Ganguli, P. Liang, and C. D. Manning, “RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory, ” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , B. Webber, T. Cohn, Y. He, and Y. Liu, Eds., Online: Association for Computational Linguistics, Nov. 2020, pp. 1978–2010. d...

work page doi:10.18653/v1/2020.emnlp-main.156 2020
[25]

arXiv preprint arXiv:2105.11115 , year=

S. Yao, B. Peng, C. Papadimitriou, and K. Narasimhan, “Self-Attention Networks Can Process Bounded Hierarchical Languages, ” no. arXiv:2105.11115. arXiv, Mar. 2023. doi: 10.48550/arXiv.2105.11115

work page doi:10.48550/arxiv.2105.11115 2023
[26]

On the Ability and Limitations of Transformers to Recognize Formal Languages,

S. Bhattamishra, K. Ahuja, and N. Goyal, “On the Ability and Limitations of Transformers to Recognize Formal Languages, ” no. arXiv:2009.11264. arXiv, Oct. 2020. doi: 10.48550/arXiv.2009.11264

work page doi:10.48550/arxiv.2009.11264 2009
[27]

Universality in elementary cellular automata.Complex Systems, 15(1): 1–40, 2004

M. Cook, “Universality in Elementary Cellular Automata, ” Complex Systems, vol. 15, no. 1, pp. 1–40, Mar. 2004, doi: 10.25088/ComplexSystems.15.1.1

work page doi:10.25088/complexsystems.15.1.1 2004
[28]

Growing 3D Artefacts and Functional Machines with Neural Cellular Automata,

S. Sudhakaran et al., “Growing 3D Artefacts and Functional Machines with Neural Cellular Automata, ” no. arXiv:2103.08737. arXiv, June 2021. doi: 10.48550/arXiv.2103.08737

work page doi:10.48550/arxiv.2103.08737 2021
[29]

Self-Classifying MNIST Digits,

E. Randazzo, A. Mordvintsev, E. Niklasson, M. Levin, and S. Greydanus, “Self-Classifying MNIST Digits, ” Distill, vol. 5, no. 8, p. e27.002, Aug. 2020, doi: 10.23915/distill.00027.002

work page doi:10.23915/distill.00027.002 2020
[30]

relaxation-oscillations

A. Variengien, S. Nichele, T. Glover, and S. Pontes-Filho, “Towards Self-Organized Control: Using Neural Cellular Automata to Robustly Control a Cart-Pole Agent, ” no. arXiv:2106.15240. arXiv, July 2021. doi: 10.48550/arXiv.2106.15240

work page doi:10.48550/arxiv.2106.15240 2021
[31]

Attention-Based Neural Cellular Automata,

M. Tesfaldet, D. Nowrouzezahrai, and C. Pal, “Attention-Based Neural Cellular Automata, ” no. arXiv:2211.01233. arXiv, Nov. 2022. doi: 10.48550/arXiv.2211.01233

work page doi:10.48550/arxiv.2211.01233 2022
[32]

Memory-Augmented Recurrent Neural Net - works Can Learn Generalized Dyck Languages,

M. Suzgun, S. Gehrmann, Y. Belinkov, and S. M. Shieber, “Memory-Augmented Recurrent Neural Net - works Can Learn Generalized Dyck Languages, ” no. arXiv:1911.03329. arXiv, Nov. 2019. doi: 10.48550/ arXiv.1911.03329

work page arXiv 1911
[33]

Assessing the Ability of LSTM s to Learn Syntax-Sensitive Dependencies

T. Linzen, E. Dupoux, and Y. Goldberg, “Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies, ” Transactions of the Association for Computational Linguistics , vol. 4, pp. 521–535, 2016, doi: 10.1162/tacl_a_00115

work page doi:10.1162/tacl_a_00115 2016
[34]

Targeted Syntactic Evaluation of Language Models,

R. Marvin and T. Linzen, “Targeted Syntactic Evaluation of Language Models, ” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds., Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1192–

2018
[35]

doi: 10.18653/v1/D18-1151

work page doi:10.18653/v1/d18-1151
[36]

Efficient Streaming Language Models with Attention Sinks

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient Streaming Language Models with Attention Sinks, ” no. arXiv:2309.17453. arXiv, Apr. 2024. doi: 10.48550/arXiv.2309.17453

work page internal anchor Pith review doi:10.48550/arxiv.2309.17453 2024
[37]

Exploring Length Generalization in Large Language Models,

C. Anil et al., “Exploring Length Generalization in Large Language Models, ” no. arXiv:2207.04901. arXiv, Nov. 2022. doi: 10.48550/arXiv.2207.04901

work page doi:10.48550/arxiv.2207.04901 2022
[38]

Three Factors in Language Design,

N. Chomsky, “Three Factors in Language Design, ” Linguistic Inquiry, vol. 36, no. 1, pp. 1–22, Jan. 2005, doi: 10.1162/0024389052993655

work page doi:10.1162/0024389052993655 2005
[39]

The Algebraic Theory of Context-Free Languages*,

N. Chomsky and M. Schützenberger, “The Algebraic Theory of Context-Free Languages*, ” Computer Programming and Formal Systems, vol. 35. in Studies in Logic and the Foundations of Mathematics, vol
[40]

118–161, 1963

Elsevier, pp. 118–161, 1963. doi: 10.1016/S0049-237X(08)72023-8

work page doi:10.1016/s0049-237x(08)72023-8 1963
[41]

, title =

G. Delétang et al., “Neural Networks and the Chomsky Hierarchy, ” no. arXiv:2207.02098. arXiv, Feb

work page arXiv
[42]

doi: 10.48550/arXiv.2207.02098

work page doi:10.48550/arxiv.2207.02098
[43]

J. E. Hopcroft and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation. Addison- Wesley, 1979

1979
[44]

Emergent Abilities of Large Language Models

J. Wei et al., “Emergent Abilities of Large Language Models, ” no. arXiv:2206.07682. arXiv, Oct. 2022. doi: 10.48550/arXiv.2206.07682

work page internal anchor Pith review doi:10.48550/arxiv.2206.07682 2022
[45]

On the Spatiotemporal Dynamics of Generalization in Neural Networks

Z. Wei, “On the Spatiotemporal Dynamics of Generalization in Neural Networks, ” no. arXiv:2602.01651. arXiv, Feb. 2026. doi: 10.48550/arXiv.2602.01651

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.01651 2026
[46]

What algorithms can transformers learn? a study in length generalization.arXiv preprint arXiv:2310.16028,

H. Zhou et al., “What Algorithms Can Transformers Learn? A Study in Length Generalization, ” no. arXiv:2310.16028. arXiv, Oct. 2023. doi: 10.48550/arXiv.2310.16028. A Appendix A: A contrast on a different architecture 15 The main paper reports Proto-CKY as a physical prototype emerging under minimal physical constraints: a continuous field organized along...

work page doi:10.48550/arxiv.2310.16028 2023
[47]

automatically

If weights were concentrated along the diagonal, we should see a band structure around it. That is not what we see (Figure 8). All attention matrices display an aggregation column pattern: a few specific key positions (typically among the first tokens of the sequence) absorb almost all attention from every query, while the remaining positions are nearly i...