pith. machine review for the scientific record. sign in

arxiv: 2605.07120 · v1 · submitted 2026-05-08 · 💻 cs.LG · stat.ML

Recognition: no theorem link

When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords template classificationfresh-symbol generalizationcollision graphmargin transferkernel logistic regressionsymbol invariancetransformer approximation
0
0 comments X

The pith

In template classification, logistic predictors decompose into ideal symbol-invariant rules plus perturbations from token overlaps modeled by a colored collision graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies fixed-label template tasks where models must classify using shared latent templates despite disjoint train and test vocabularies. It focuses on regularized kernel logistic classification under the transformer-kernel approximation. The key decomposition separates the learned predictor into an ideal template-level classifier and a finite-sample term driven by accidental overlaps in the training set. These overlaps are captured by a colored collision graph whose structure determines whether the ideal margin survives symbol renaming. High-probability guarantees follow when graph geometry preserves separation, refining earlier conditions that treated diversity only through vocabulary size.

Core claim

The learned predictor decomposes into an ideal template-level classifier and a finite-sample perturbation caused by accidental token overlaps in the training data. Overlaps are encoded by a colored collision graph, and high-probability margin-transfer guarantees are proved for fresh-symbol classification. Vocabulary size controls the average rate of collisions, but collision geometry controls whether the ideal classification margin is preserved.

What carries the argument

Colored collision graph encoding token overlaps whose geometry determines margin preservation under symbol renaming.

If this is right

  • The same perturbation analysis applies to abstraction-augmented inputs and yields a margin-versus-collision test for prompting strategies.
  • Synthetic experiments confirm that regularization strength, sample size, and kernel structure affect fresh-symbol performance exactly as predicted by the graph geometry.
  • Scalar diversity conditions are replaced by a joint rate-and-geometry criterion for when symbol names cease to matter.
  • The framework supplies explicit high-probability bounds that quantify how much training overlap can be tolerated before the ideal rule is lost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training sets could be actively chosen to avoid collision geometries that destroy margins, improving generalization without larger vocabularies.
  • If real transformers depart from the kernel regime, the guarantees may fail in ways that suggest new inductive biases are needed.
  • The collision-graph view might extend to non-kernel models by measuring effective overlaps in their internal representations.
  • Prompt engineering could be scored by computing the implied collision graph on augmented inputs and checking margin preservation.

Load-bearing premise

The transformer must behave like the kernel model and the overlaps must follow the graph structure that keeps the ideal margin intact.

What would settle it

A trained transformer whose decision boundary deviates substantially from the ideal template classifier even when the collision graph is favorable would falsify the margin-transfer guarantees.

Figures

Figures reproduced from arXiv: 2605.07120 by Jelena Bradic, Wenjie Guan.

Figure 1
Figure 1. Figure 1: Pairwise token reuse preserves ρ and edge-token counts, but collision structure differs in maximum degree, dmax, color imbalance, and centered spectral norm, ∥C∥op. We also show that vocabulary diversity ρ, as defined in [9], is only a marginal proxy: it controls the largest marginal probability of any token occupying a wildcard slot, but not how collisions are distributed [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 2
Figure 2. Figure 2: What the collision graph records. A graph edge appears only for a non-fresh pair. Here (i, j) collides through the wildcard token b, while (i, k) is fresh and has zero Gram discrepancy. Remark 3.3 (Equality selectors and signed collision weights). The collision graph can be read as a sample-level abstraction of equality selectors. This connects the present perturbation view to programmatic transformer desc… view at source ↗
Figure 3
Figure 3. Figure 3: Collision-graph bounds. Eight explicit finite template tasks are generated by wildcard substitutions and literal hits, after which the colored collision graph and all five Theorem 3.7 certificates are computed. Panel (a) compares the five graph certificates to the scalar proxy Bρ = wmax/ρ. Panel (b) decomposes the selected route. Panel (c) plots log10(Bρ/B♯ λ ), with positive bars indicating an improvement… view at source ↗
Figure 4
Figure 4. Figure 4: Test accuracy of one-layer transformers with different K-Q and V-O multi￾pliers. Left: accuracy for binary classification. Right: accuracy for finding the majority. Based on the simulation, we have the following findings. First, vanilla one-layer transformers can generalize to unseen symbols, but requires large sample. This is similar to the finding for the copyout task from [9], transformers without any m… view at source ↗
Figure 5
Figure 5. Figure 5: Test error curves for one-layer transformers with different embedding dimen￾sions for binary classification. 7. Discussion Our results suggest that fresh-symbol generalization is governed not only by token diversity, but by the geometry of finite-sample substitution collisions. The collision graph provides a mechanism-level explanation for when empirical transformer kernels preserve the margin of the ideal… view at source ↗
Figure 6
Figure 6. Figure 6: Generated colored collision graphs for the eight template tasks, showing how different wildcard and literal substitutions induce distinct graph geometries. This produces 12 train–train edges and 2 test edges. The selected graph certificate is B ♯ λ = 0.0022, achieved by the ANOVA route; the scalar proxy is Bρ = 0.1667. Case C8: literal-correctable. Set nT1 = nT2 = nT3 = 12. Create a large literal-induced b… view at source ↗
Figure 7
Figure 7. Figure 7: Binary same/different task. Architecture comparison under the regime of Sec. K.4.1. Left: vanilla Transformers. Right: tied-embedding/unembedding trans￾formers. 10 1 10 2 10 3 Number of training samples 0.0 0.1 0.2 0.3 0.4 0.5 Test error dim=64 dim=128 dim=256 dim=512 dim=1024 dim=2048 dim=4096 (a) (MKQ, MV O) = (0, 0) 10 1 10 2 10 3 Number of training samples 0.0 0.1 0.2 0.3 0.4 0.5 Test error dim=64 dim=… view at source ↗
Figure 8
Figure 8. Figure 8: Binary task: width sweep. Test error vs. n for several embedding dimensions d for different KQ and V O multipliers, for vanilla Transformers (Sec. K.4.2) [PITH_FULL_IMAGE:figures/full_fig_p080_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Binary task: width sweep. Test error vs. n for several embedding dimensions d for different KQ and V O multipliers, for tied-embedding/unembedding Transformers (Sec. K.4.2). 10 2 10 3 Number of training samples 0.0 0.2 0.4 0.6 0.8 1.0 Test accuracy Transformer Transformer + VO x100 Transformer + KQ x100 Transformer + KQ x100, VO x100 10 2 10 3 Number of training samples 0.0 0.2 0.4 0.6 0.8 1.0 Test accurac… view at source ↗
Figure 10
Figure 10. Figure 10: Four-class relational task. Architecture comparison. Left: vanilla Transformers. Right: tied-embedding/unembedding transformers [PITH_FULL_IMAGE:figures/full_fig_p081_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Multiclass task: width sweep. Test error vs. n for several embedding dimensions d for different KQ and V O multipliers, for vanilla Transformers (Sec. K.4.2) [PITH_FULL_IMAGE:figures/full_fig_p082_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Multiclass task: width sweep. Test error vs. n for several embedding dimensions d for different KQ and V O multipliers, for tied-embedding/unembedding Transformers (Sec. K.4.2). 10 2 10 3 Number of training samples 0.0 0.2 0.4 0.6 0.8 1.0 Test accuracy Transformer Transformer + VO x100 Transformer + KQ x100 Transformer + KQ x100, VO x100 10 2 10 3 Number of training samples 0.0 0.2 0.4 0.6 0.8 1.0 Test ac… view at source ↗
Figure 13
Figure 13. Figure 13: Majority / copy-wildcard task. Architecture comparison. Left: vanilla Transformers. Right: tied-embedding/unembedding transformers [PITH_FULL_IMAGE:figures/full_fig_p083_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Majority task: width sweep. For vanilla Transformers. 10 1 10 2 10 3 Number of training samples 0.0 0.2 0.4 0.6 0.8 1.0 Test error dim=64 dim=128 dim=256 dim=512 dim=1024 dim=2048 dim=4096 (a) (MKQ, MV O) = (0, 0) 10 1 10 2 10 3 Number of training samples 0.0 0.2 0.4 0.6 0.8 Test error dim=64 dim=128 dim=256 dim=512 dim=1024 dim=2048 dim=4096 (b) (MKQ, MV O) = (100, 100) [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 15
Figure 15. Figure 15: Majority task: width sweep. For tied-embedding/unembedding Trans￾formers. 10 1 10 2 10 3 10 4 10 5 Number of training samples 0.0 0.2 0.4 0.6 0.8 1.0 Test error Transformer 0.001 Transformer + VO x100 0.001 Transformer + KQ x100 0.001 Transformer + KQ,VO x100 0.001 [PITH_FULL_IMAGE:figures/full_fig_p084_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Variable-assignment / print task. Test error vs. number of training samples for vanilla Transformer and the three identity-multiplier variants, at d = 1024, D = 2 (Sec. K.4.2) [PITH_FULL_IMAGE:figures/full_fig_p084_16.png] view at source ↗
read the original abstract

Template tasks have emerged as a clean testbed for asking whether transformers reason with abstract symbols rather than concrete token names. We study the fixed-label classification version of this problem, where train and test examples share latent templates but may use disjoint vocabularies. Unlike next-token prediction, the model need not emit unseen symbols; it must learn a decision rule invariant to symbol renaming. We analyze regularized kernel logistic classification in the transformer-kernel regime. Our main result decomposes the learned predictor into an ideal template-level classifier and a finite-sample perturbation caused by accidental token overlaps in the training data. We encode these overlaps by a colored collision graph and prove high-probability margin-transfer guarantees for fresh-symbol classification. This perspective extends template-based analyses to logistic classification and refines scalar diversity conditions: vocabulary size controls the average rate of collisions, but collision geometry controls whether the ideal classification margin is preserved. More broadly, the same perturbation framework applies to abstraction-augmented inputs, yielding a general margin-versus-collision criterion for identifying when prompting strategies improve fresh-symbol generalization. Synthetic template experiments illustrate the predicted roles of regularization, sample size, and transformer-kernel structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes regularized kernel logistic classification in the transformer-kernel regime for fixed-label template classification tasks, where train and test share latent templates but may use disjoint vocabularies. The central result decomposes the learned predictor into an ideal template-level classifier plus a finite-sample perturbation from accidental token overlaps, encoded via a colored collision graph. High-probability margin-transfer guarantees are proved for fresh-symbol classification. The work refines scalar diversity conditions by separating vocabulary size (average collision rate) from collision geometry (margin preservation), extends the framework to abstraction-augmented inputs, and illustrates predictions via synthetic experiments on regularization, sample size, and kernel structure.

Significance. If the decomposition and guarantees hold under the stated assumptions, the paper supplies a principled perturbation analysis that clarifies when symbol renaming invariance emerges in logistic settings. It gives concrete credit to the distinction between average collision rate and geometry, and the margin-versus-collision criterion for prompting strategies is a useful generalization. The synthetic experiments directly test the predicted roles of regularization and sample size, providing reproducible support for the theory. This framework could inform both theoretical work on transformer abstraction and practical choices in prompt design.

major comments (2)
  1. [§3] §3 (Transformer-kernel regime): The high-probability margin-transfer guarantees are derived under the assumption that the transformer behaves as a kernel method, yet no quantitative bounds are supplied on the approximation error to actual finite-width attention or optimization dynamics; this is load-bearing for the claim that the decomposition applies to transformers rather than only to the idealized kernel model.
  2. [Definition 2.3 and Theorem 3.1] Colored collision graph construction (Definition 2.3 and Theorem 3.1): The graph encodes overlaps to control the perturbation term, but the manuscript does not demonstrate that the graph construction is independent of label information or that its geometry is preserved under the data distribution; without this, the separation between vocabulary size and margin preservation rests on an unverified modeling choice.
minor comments (2)
  1. [§5] The synthetic experiments in §5 are well-aligned with the theory but would benefit from an explicit statement of the kernel hyperparameters used and a brief comparison to a non-kernel baseline to isolate the regime effect.
  2. [Eq. (8)] Notation for the perturbation term (Eq. (8)) could be cross-referenced more clearly when it reappears in the margin bound statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive assessment of the significance of our work. We address the two major comments below, providing clarifications and indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Transformer-kernel regime): The high-probability margin-transfer guarantees are derived under the assumption that the transformer behaves as a kernel method, yet no quantitative bounds are supplied on the approximation error to actual finite-width attention or optimization dynamics; this is load-bearing for the claim that the decomposition applies to transformers rather than only to the idealized kernel model.

    Authors: We agree that our results are derived under the transformer-kernel regime assumption. The manuscript does not provide quantitative bounds on the approximation error between the kernel model and finite-width attention mechanisms or specific optimization paths, as the focus is on the idealized setting. This is a modeling choice to enable the decomposition analysis. We will revise the discussion in §3 to more explicitly delineate the scope of the claims, noting that extensions to finite transformers would require additional approximation theory, which we leave for future work. This addresses the load-bearing aspect by clarifying the idealized nature of the model. revision: yes

  2. Referee: [Definition 2.3 and Theorem 3.1] Colored collision graph construction (Definition 2.3 and Theorem 3.1): The graph encodes overlaps to control the perturbation term, but the manuscript does not demonstrate that the graph construction is independent of label information or that its geometry is preserved under the data distribution; without this, the separation between vocabulary size and margin preservation rests on an unverified modeling choice.

    Authors: The construction of the colored collision graph in Definition 2.3 is based exclusively on the observed token overlaps in the training set inputs and does not incorporate label information; labels are assigned at the template level independently of symbol identities. The geometry of the graph is analyzed probabilistically in Theorem 3.1, where we show that under the data distribution, with high probability the perturbation is controlled when the average collision rate (governed by vocabulary size) and the margin conditions (governed by geometry) are satisfied. The separation is thus a consequence of the theorem rather than an unverified choice. To make this explicit, we will add a clarifying paragraph after Definition 2.3 explaining the label-independence and the role of the data distribution in preserving the relevant geometric properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated assumptions

full rationale

The paper derives a decomposition of the learned predictor into an ideal template classifier plus perturbation from overlaps (modeled via colored collision graph) and proves margin-transfer guarantees, all explicitly conditioned on the transformer-kernel regime. This is a standard theoretical proof structure with no reduction of the central claim to its inputs by construction, no fitted parameters renamed as predictions, and no load-bearing self-citations or ansatzes smuggled via prior work. The analysis remains independent within its modeling framework and does not equate the result to the assumptions themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the transformer-kernel regime approximation and standard high-probability concentration tools for the margin guarantees. The colored collision graph is introduced as a modeling device for overlaps. No explicit free parameters are described as fitted to target data beyond standard regularization.

free parameters (1)
  • regularization strength
    Standard hyperparameter in logistic classification that trades off fit and complexity; not described as fitted specifically to the margin result.
axioms (2)
  • domain assumption The model operates in the transformer-kernel regime
    The entire analysis is conducted under this approximation as stated.
  • standard math High-probability concentration inequalities apply to margin transfer
    Invoked to establish the guarantees for fresh-symbol classification.
invented entities (1)
  • colored collision graph no independent evidence
    purpose: To encode and analyze accidental token overlaps in training data for the perturbation term
    New modeling construct central to decomposing the predictor and controlling margin preservation.

pith-pipeline@v0.9.0 · 5498 in / 1572 out tokens · 51724 ms · 2026-05-11T01:25:30.548775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Generalization on the Unseen, Logic Reasoning and Degree Curriculum.Journal of Machine Learning Research, 25(331):1–58, 2024

    Emmanuel Abbe, Samy Bengio, Aryo Lotfi, and Kevin Rizk. Generalization on the Unseen, Logic Reasoning and Degree Curriculum.Journal of Machine Learning Research, 25(331):1–58, 2024

  2. [2]

    Emergence of Symbolic Abstraction Heads for In-Context Learning in Large Language Models

    Ali Al-Saeedi and Aki Härmä. Emergence of Symbolic Abstraction Heads for In-Context Learning in Large Language Models. InProceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025, pages 86–96, Association for Computational Linguistics, 2025

  3. [3]

    Lepori, Jack Merullo, and Ellie Pavlick

    Suraj Anand, Michael A. Lepori, Jack Merullo, and Ellie Pavlick. Dual Process Learning: Controlling Use of In- Context vs. In-Weights Strategies with Weight Forgetting. InInternational Conference on Learning Representations, 2025

  4. [4]

    Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang

    Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On Exact Computation with an Infinitely Wide Neural Net. InAdvances in Neural Information Processing Systems, volume 32, 2019

  5. [5]

    Self-Concordant Analysis for Logistic Regression.Electronic Journal of Statistics, 4:384–414, 2010

    Francis Bach. Self-Concordant Analysis for Logistic Regression.Electronic Journal of Statistics, 4:384–414, 2010

  6. [6]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. InProceedings of the International Conference on Learning Representations (ICLR), 2014

  7. [7]

    Systematic Generalization: What Is Required and Can It Be Learned? InInternational Conference on Learning Representations, 2019

    Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic Generalization: What Is Required and Can It Be Learned? InInternational Conference on Learning Representations, 2019

  8. [8]

    Bartlett, Michael I

    Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, Classification, and Risk Bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

  9. [9]

    When can transformers reason with abstract symbols? InProceedings of the International Conference on Learning Representations (ICLR), 2024

    Enric Boix-Adserà, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua Susskind. When can transformers reason with abstract symbols? InProceedings of the International Conference on Learning Representations (ICLR), 2024

  10. [10]

    Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

    Andrea Caponnetto and Ernesto De Vito. Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

  11. [11]

    Toward Understanding In-Context vs

    Bryan Chan, Xinyi Chen, András György, and Dale Schuurmans. Toward Understanding In-Context vs. In-Weight Learning. InInternational Conference on Learning Representations, 2025

  12. [12]

    On Lazy Training in Differentiable Programming

    Lénaïc Chizat and Francis Bach. On Lazy Training in Differentiable Programming. InAdvances in Neural Information Processing Systems, volume 32, 2019

  13. [13]

    Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning

    Subhabrata Dutta, Ishan Pandey, Joykirat Singh, Sunny Manchanda, Soumen Chakrabarti, and Tanmoy Chakraborty. Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, number 16, pages 17951–17959, 2024

  14. [14]

    Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and Fate: Limits of Transformers on Compositionality. InAdvances in Neural Information Processing Systems, vol...

  15. [15]

    Efficient Tool Use with Chain-of-Abstraction Reasoning

    Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, and Tianlu Wang. Efficient Tool Use with Chain-of-Abstraction Reasoning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2727–2743, 2025. 18 WENJIE GUAN AND JELENA BRADIC

  16. [16]

    AbstRaL: Augmenting LLMs’ Reasoning by Reinforcing Abstract Thinking

    Silin Gao, Antoine Bosselut, Samy Bengio, and Emmanuel Abbe. AbstRaL: Augmenting LLMs’ Reasoning by Reinforcing Abstract Thinking. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  17. [17]

    On the Compositional Generalization Gap of In-Context Learning

    Arian Hosseini, Ankit Vani, Dzmitry Bahdanau, Alessandro Sordoni, and Aaron Courville. On the Compositional Generalization Gap of In-Context Learning. InProceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 272–280, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 2022

  18. [18]

    A Taxonomy and Review of Generalization Research in NLP.Nature Machine Intelligence, 5:1161–1174, 2023

    Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhijing Jin, and others. A Taxonomy and Review of Generalization Research in NLP.Nature Machine Intelligence, 5:1161–1174, 2023

  19. [19]

    Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence

    İlker Işık, Ramazan Gokberk Cinbis, and Ebru Aydin Gol. Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 267, pages 26523–26541, PMLR, 2025

  20. [20]

    Names Don’t Matter: Symbol-Invariant Transformer for Open-Vocabulary Learning

    İlker Işık and Wenchao Li. Names Don’t Matter: Symbol-Invariant Transformer for Open-Vocabulary Learning. arXiv preprint arXiv:2601.23169, 2026

  21. [21]

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. InAdvances in Neural Information Processing Systems, volume 31, pages 8580–8589, 2018

  22. [22]

    Su, Camillo Jose Taylor, and Dan Roth

    Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo Jose Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  23. [23]

    Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

    Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data. InInternational Conference on Learning Represen...

  24. [24]

    COGS: A Compositional Generalization Challenge Based on Semantic Interpretation

    Najoung Kim and Tal Linzen. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Association for Computational Linguistics, Online, 2020

  25. [25]

    To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

    Nevena Lazić, Liam Fowl, András György, and Csaba Szepesvári. To See the Unseen: On the Generalization Ability of Transformers in Symbolic Reasoning. arXiv preprint arXiv:2604.21632, 2026

  26. [26]

    Ridgeless

    Tengyuan Liang and Alexander Rakhlin. Just Interpolate: Kernel “Ridgeless” Regression Can Generalize.The Annals of Statistics, 48(3):1329–1347, 2020

  27. [27]

    When Does Compositional Structure Yield Compositional Generalization? A Kernel Theory

    Samuel Lippl and Kim Stachenfeld. When Does Compositional Structure Yield Compositional Generalization? A Kernel Theory. InInternational Conference on Learning Representations, 2025

  28. [28]

    GSM- Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM- Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR), 2025

  29. [29]

    Universality of Kernel Random Matrices and Kernel Regression in the Quadratic Regime.Journal of Machine Learning Research, 26:1–73, 2025

    Parthe Pandit, Zhichao Wang, and Yizhe Zhu. Universality of Kernel Random Matrices and Kernel Regression in the Quadratic Regime.Journal of Machine Learning Research, 26:1–73, 2025

  30. [30]

    ICLR: In-Context Learning of Representations

    Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. ICLR: In-Context Learning of Representations. InInternational Conference on Learning Representations, 2025

  31. [31]

    Generalization Properties of Learning with Random Features

    Alessandro Rudi and Lorenzo Rosasco. Generalization Properties of Learning with Random Features. InAdvances in Neural Information Processing Systems, volume 30, 2017

  32. [32]

    Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M. Lake. A Benchmark for Systematic Generalization in Grounded Language Understanding. InAdvances in Neural Information Processing Systems, volume 33, pages 19861–19872, 2020

  33. [33]

    Bernhard Schölkopf, Ralf Herbrich, and Alexander J. Smola. A Generalized Representer Theorem. InComputational Learning Theory, Lecture Notes in Computer Science, volume 2111, pages 416–426, Springer, 2001

  34. [34]

    Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks.Journal of Artificial Intelligence Research, 84(23), 2025

    Paul Smolensky, Roland Fernandez, Zhenghao Herbert Zhou, Mattia Opper, Adam Davies, and Jianfeng Gao. Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks.Journal of Artificial Intelligence Research, 84(23), 2025

  35. [35]

    Springer, New York, 2008

    Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer, New York, 2008

  36. [36]

    Schema-Learning and Rebinding as Mechanisms of In-Context Learning and Emergence

    Sivaramakrishnan Swaminathan, Antoine Dedieu, Rajkumar Vasudeva Raju, Murray Shanahan, Miguel Lázaro- Gredilla, and Dileep George. Schema-Learning and Rebinding as Mechanisms of In-Context Learning and Emergence. InAdvances in Neural Information Processing Systems, volume 36, pages 28785–28804, 2023

  37. [37]

    In-Context Algebra

    Eric Todd, Jannik Brinkmann, Rohit Gandikota, and David Bau. In-Context Algebra. arXiv preprint arXiv:2512.16902, 2025. WHEN SYMBOL NAMES SHOULD NOT MATTER 19

  38. [38]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  39. [39]

    Transformers Learn In-Context by Gradient Descent

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers Learn In-Context by Gradient Descent. InProceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 202, pages 35151–35174, PMLR, 2023

  40. [40]

    Taylor Whittington Webb, Ishan Sinha, and Jonathan D. Cohen. Emergent Symbols through Binding in External Memory. InInternational Conference on Learning Representations, 2021

  41. [41]

    Optimal Learning of Kernel Logistic Regression for Complex Classification Scenarios

    Hongwei Wen, Annika Betken, and Hanyuan Hang. Optimal Learning of Kernel Logistic Regression for Complex Classification Scenarios. InInternational Conference on Learning Representations, 2025

  42. [42]

    How do transformers learn variable binding in symbolic programs?arXiv preprint arXiv:2505.20896,

    Yiwei Wu, Atticus Geiger, and Raphaël Millière. How Do Transformers Learn Variable Binding in Symbolic Programs? arXiv preprint arXiv:2505.20896, 2025

  43. [43]

    An Explanation of In-Context Learning as Implicit Bayesian Inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An Explanation of In-Context Learning as Implicit Bayesian Inference. InInternational Conference on Learning Representations, 2022

  44. [44]

    Cohen, and Taylor Whittington Webb

    Yukang Yang, Declan Iain Campbell, Kaixuan Huang, Mengdi Wang, Jonathan D. Cohen, and Taylor Whittington Webb. Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 267, pages 70515–70549, PMLR, 2025

  45. [45]

    Kernel Regression in Structured Non-IID Settings: Theory and Implications for Denoising Score Learning

    Dechen Zhang, Zhenmei Shi, Yi Zhang, Yingyu Liang, and Difan Zou. Kernel Regression in Structured Non-IID Settings: Theory and Implications for Denoising Score Learning. InAdvances in Neural Information Processing Systems, 2025

  46. [46]

    Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions.Journal of Machine Learning Research, 26:1–63, 2025

    Haobo Zhang, Yicheng Li, Weihao Lu, and Qian Lin. Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions.Journal of Machine Learning Research, 26:1–63, 2025

  47. [47]

    Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization

    Tong Zhang. Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization. The Annals of Statistics, 32(1):56–85, 2004

  48. [48]

    Neural Module Networks

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural Module Networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016

  49. [49]

    David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, and Timothy Lillicrap. Measuring Abstract Reasoning in Neural Networks. InProceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 80, pages 511–520, PMLR, 2018

  50. [50]

    Oxford University Press, Oxford, 2013

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, 2013

  51. [51]

    F. R. K. Chung, R. L. Graham, and R. M. Wilson. Quasi-Random Graphs.Combinatorica, 9(4):345–362, 1989

  52. [52]

    Group Equivariant Convolutional Networks

    Taco Cohen and Max Welling. Group Equivariant Convolutional Networks. InProceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 48, pages 2990–2999, PMLR, 2016

  53. [53]

    Fodor and Zenon W

    Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and Cognitive Architecture: A Critical Analysis.Cognition, 28(1–2):3–71, 1988

  54. [54]

    Deep Symmetry Networks

    Robert Gens and Pedro Domingos. Deep Symmetry Networks. InAdvances in Neural Information Processing Systems, volume 27, 2014

  55. [55]

    Invariant Kernel Functions for Pattern Analysis and Machine Learning

    Bernard Haasdonk and Hans Burkhardt. Invariant Kernel Functions for Pattern Analysis and Machine Learning. Machine Learning, 68(1):35–61, 2007

  56. [56]

    Convolution Kernels on Discrete Structures

    David Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, University of California, Santa Cruz, 1999

  57. [57]

    Not-So-CLEVR: Learning Same–Different Relations Strains Feedforward Neural Networks.Interface Focus, 8(4):20180011, 2018

    Junkyung Kim, Matthew Ricci, and Thomas Serre. Not-So-CLEVR: Learning Same–Different Relations Strains Feedforward Neural Networks.Interface Focus, 8(4):20180011, 2018

  58. [58]

    InProceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 80, pages 2873–2882, PMLR, 2018

    BrendenM.LakeandMarcoBaroni.GeneralizationwithoutSystematicity: OntheCompositionalSkillsofSequence-to- Sequence Recurrent Networks. InProceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 80, pages 2873–2882, PMLR, 2018

  59. [59]

    Compositional Generalization by Learning Analytical Expressions

    Qian Liu, Bo An, Jian-Guang Lou, Bei Chen, Zhouhan Lin, Yan Gao, Bin Zhou, and Dongmei Zhang. Compositional Generalization by Learning Analytical Expressions. InAdvances in Neural Information Processing Systems, volume 33, pages 11416–11427, 2020

  60. [60]

    Text Classification Using String Kernels.Journal of Machine Learning Research, 2:419–444, 2002

    Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text Classification Using String Kernels.Journal of Machine Learning Research, 2:419–444, 2002

  61. [61]

    Toward Compositional Behavior in Neural Models: A Survey of Current Views

    Kate McCurdy, Paul Soulos, Henry Conklin, Mattia Opper, Paul Smolensky, Jianfeng Gao, and Roland Fernandez. Toward Compositional Behavior in Neural Models: A Survey of Current Views. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9323–9339, Association for Computational Linguistics, 2024. 20 WENJIE GUAN AND...

  62. [62]

    Learning with Group Invariant Features: A Kernel Perspective

    Youssef Mroueh, Stephen Voinea, and Tomaso Poggio. Learning with Group Invariant Features: A Kernel Perspective. InAdvances in Neural Information Processing Systems, volume 28, 2015

  63. [63]

    Tenenbaum, and Brenden M

    Maxwell Nye, Armando Solar-Lezama, Joshua B. Tenenbaum, and Brenden M. Lake. Learning Compositional Rules via Neural Program Synthesis. InAdvances in Neural Information Processing Systems, volume 33, pages 10832–10842, 2020

  64. [64]

    In-Context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  65. [65]

    Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A Simple Neural Network Module for Relational Reasoning. InAdvances in Neural Information Processing Systems, volume 30, 2017

  66. [66]

    Joel A. Tropp. An Introduction to Matrix Concentration Inequalities.Foundations and Trends in Machine Learning, 8(1–2):1–230, 2015

  67. [67]

    Thinking Like Transformers

    Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking Like Transformers. InProceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, volume 139, pages 11080–11090, PMLR, 2021

  68. [68]

    copy-a-wildcard

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep Sets. InAdvances in Neural Information Processing Systems, volume 30, 2017. Contents Appendix A. Notation and standing objects 21 A.1. Notation summary 21 Appendix B. The collision graph as the perturbation object 23 Appendix C. Certificate b...