pith. machine review for the scientific record. sign in

arxiv: 2604.21632 · v1 · submitted 2026-04-23 · 💻 cs.AI

Recognition: unknown

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords transformersgeneralizationsymbolic reasoningpropositional logicunseen tokensembedding collapsein-context learningdecoder-only models
0
0 comments X

The pith

Unembeddings of unseen tokens collapse to nearly identical vectors during training, blocking transformers from distinguishing new variables in logic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Decoder-only transformers solve in-context propositional logic problems but fail when problems use variable names absent from training. The paper demonstrates that a key cause is the collapse of unembeddings for those unseen tokens, which converge to nearly the same vector and hinder distinguishing multiple new variables. This holds especially when embedding and unembedding weights are shared. Combining a minor architecture change that aids token copying, greater training data variety, and periodic resets or freezing of embeddings enables the models to generalize successfully to unseen tokens. The same collapse pattern appears in real models such as the Gemma 3 family, where it yields poor initializations for fine-tuning on reserved tokens.

Core claim

The unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. This collapse makes distinguishing multiple unseen variables difficult for the model, especially when the embedding and unembedding parameters are shared, and supplies a mechanistic account for why existing heuristic interventions like active forgetting succeed.

What carries the argument

Unembedding collapse, the convergence of last-layer weights for unseen tokens to a shared vector that prevents variable distinction in symbolic tasks.

If this is right

  • Periodic resetting of token embeddings mitigates collapse and restores the ability to handle unseen variables.
  • A small architecture adjustment that improves copying, paired with data diversity and embedding resets or freezing, produces generalization to unseen tokens.
  • Unused tokens in models like the Gemma 3 family exhibit correlated embeddings that form poor starting points for downstream fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar unembedding collapse may limit generalization in other abstract or symbolic reasoning settings beyond propositional logic.
  • Embedding resets or freezing could serve as a general tactic to boost out-of-distribution performance in language models handling novel symbols.
  • Tracking similarity among unembedding vectors during training might act as an early warning for impending failures on unseen inputs.

Load-bearing premise

That the observed collapse of unembeddings is the primary driver of generalization failure rather than a side effect of other training dynamics.

What would settle it

Train a model in which unembeddings for unseen tokens remain distinct yet the model still cannot generalize to new variables in propositional logic problems.

Figures

Figures reproduced from arXiv: 2604.21632 by Andr\'as Gy\"orgy, Csaba Szepesv\'ari, Liam Fowl, Nevena Lazi\'c.

Figure 1
Figure 1. Figure 1: Left: An example of a propositional logic problem with reasoning used for training. Given a set of rules (definite clauses), a set of facts (true zero-order predicates), and a query predicate, the goal is to compute the truth value of the query. The reasoning trace (shown above) executes the forward-chaining algorithm, and stops if the query is proven true. Right: Two types of test examples. At test time, … view at source ↗
Figure 2
Figure 2. Figure 2: Mean pairwise cosine similarities of the unembeddings for different hyperparame [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Propositional logic evaluation accuracy for models trained on single-token symbols. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarities between the embeddings of the 99 unused tokens and 100 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy when finetuning Gemma 3 1B IT on logic problems with predicates [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Special tokens in the vocabulary in addition to zeroth order predicates. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Formatting example for a propositional logic problem with reasoning used for [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Propositional logic evaluation accuracy for models trained on single-token symbols [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of reasoning traces in the case of a single single-token symbolic variable. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation loss on the C4 dataset with trainable and frozen embeddings and [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Stylized example where the multi-token query shares a prefix with another [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evaluation of models trained on symbols of up to [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Difference between the sorted singular values of the Frobenius-normalized [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not observed during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and provides a mechanistic explanation for the effectiveness of existing heuristic interventions like "active forgetting", which periodically reset the token (un)embeddings. Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves generalization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems. Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the correlated embeddings of these tokens are a poor initialization for finetuning applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that decoder-only transformers fail to generalize to unseen variable names in in-context propositional logic reasoning tasks, due in part to a representational collapse in which the unembeddings (last-layer weights) of unseen tokens converge to nearly identical vectors during training. This collapse compounds the known copying difficulty for unseen tokens (especially under shared embedding/unembedding parameters) and provides a mechanistic account for the success of interventions like active forgetting. The authors derive the collapse theoretically, validate it empirically on synthetic propositional logic problems, and demonstrate that a combination of a small architecture change to facilitate copying, increased data diversity, and freezing or resetting of (un)embeddings restores generalization. They further report evidence of similar collapse in the Gemma 3 family and note its implications for finetuning.

Significance. If the central claim holds, the work supplies a concrete mechanistic explanation for a previously observed generalization failure in symbolic reasoning, distinct from but interacting with the copying issue. The controlled synthetic experiments, successful intervention combination, and out-of-distribution check on Gemma 3 strengthen the contribution and could guide improved training regimes for abstract reasoning and OOV handling in transformers.

minor comments (3)
  1. [Abstract] The abstract states that the proposed interventions 'achieve generalization to unseen tokens' but does not quantify the improvement (e.g., accuracy deltas or success rates) relative to baselines; adding a brief numerical summary would help readers assess the practical impact.
  2. [Abstract] The description of the 'small architecture change facilitating copying' is not elaborated in the provided abstract or summary; a one-sentence specification of the modification (e.g., which layer or attention mechanism is altered) would improve accessibility.
  3. [Experiments] In the empirical sections, confirm that all reported results include the number of random seeds or runs and any statistical tests, particularly for the ablation studies comparing the combined interventions against individual components.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and the recommendation for minor revision. The provided summary accurately captures the central claims, including the role of unembedding collapse in limiting generalization to unseen tokens, its interaction with copying difficulties, and the effectiveness of the proposed interventions. We appreciate the recognition of the controlled experiments, theoretical derivation, and the out-of-distribution observations on Gemma 3.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent empirical measurements and theoretical analysis

full rationale

The paper's core argument—that unembedding collapse of unseen tokens contributes to generalization failure—is grounded in direct empirical observations of vector collapse during training, controlled ablation experiments, and a theoretical analysis of training dynamics. These elements do not reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations whose validity depends on the current work. Prior work on copying difficulties is cited as background but is not invoked to justify the collapse mechanism or the proposed interventions. The Gemma-3 observation functions as an external check rather than an internal bootstrap. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis assumes standard decoder-only transformer training dynamics and shared or separate embedding/unembedding matrices; no new entities are postulated and no free parameters are fitted to produce the collapse result itself.

axioms (2)
  • domain assumption Decoder-only transformers are trained with standard next-token prediction on sequences containing propositional logic expressions.
    Invoked when describing the training setup and the emergence of collapse.
  • domain assumption Embedding and unembedding matrices may be tied or untied.
    Used to explain why collapse is especially problematic under parameter sharing.

pith-pipeline@v0.9.0 · 5559 in / 1341 out tokens · 39241 ms · 2026-05-09T22:14:27.906342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification

    cs.LG 2026-05 unverdicted novelty 7.0

    Regularized kernel logistic classifiers decompose into an ideal template classifier plus a perturbation from token overlaps modeled by a colored collision graph, yielding high-probability margin-transfer guarantees fo...

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

  3. [3]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

  4. [4]

    arXiv preprint arXiv:2210.05675 , year=

    Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers.Advances in neural information processing systems, 35:18878–18891, 2022a. Stephanie CY Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K...

  5. [5]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  6. [6]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

  7. [7]

    Su, Camillo J

    10 Preprint. Under review. Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo J Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners.arXiv preprint arXiv:2406.11050,

  8. [8]

    [Accessed 2025-12-14]. Peter J. Liu, Roman Novak, Jaehoon Lee, Mitchell Wortsman, Lechao Xiao, Katie Everett, Alexander A. Alemi, Mark Kurzeja, Pierre Marcenac, Izzeddin Gur, Simon Kornblith, Kelvin Xu, Gamaleldin Elsayed, Ian Fischer, Jeffrey Pennington, Ben Adlam, and Jascha- Sohl Dickstein. Nanodo: A minimal transformer decoder-only language model impl...

  9. [9]

    Decoupled Weight Decay Regularization

    URLhttp://github.com/google-deepmind/nanodo. Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5:5,

  10. [10]

    Frontier LLMs Still Struggle with Simple Reasoning Tasks , journal =

    Alan Malek, Jiawei Ge, Nevena Lazi´ c, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks.arXiv preprint arXiv:2507.07313,

  11. [11]

    R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve.Proceedings of the National Academy of Sciences, 121(41): e2322420121, 2024a. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. Whe...

  12. [12]

    arXiv preprint arXiv:2312.03002 , year=

    Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task.arXiv preprint arXiv:2312.03002,

  13. [13]

    Randomized positional encodings boost length generalization of transformers.arXiv preprint arXiv:2305.16843,

    Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers.arXiv preprint arXiv:2305.16843,

  14. [14]

    Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan

    11 Preprint. Under review. Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Posi- tional description matters for transformers arithmetic.arXiv preprint arXiv:2311.14737,

  15. [15]

    Under review

    12 Preprint. Under review. A Proof of Lemma 4.1 Proof.Letp t i,n =p(i|x n,W (t),θ (t) ). The weights satisfy W(t+1) j −W (t+1) i = (1−λη t) W(t) j −W (t) i −η t 1 N N ∑ n=1 pt j,n −p t i,n ϕt n (2) By the Lagrange mean-value theorem, exp(b)−exp(a) =exp(c)(b−a) for some c such that a<c<b . Assume without loss of generality that ϕt⊤ n W(t) j ≥ϕ t⊤ n W(t) i ...

  16. [16]

    We use RoPE positional embeddings (Su et al., 2024)

    We use MLPs with hidden size 1024 and GELU activations (Hendrycks & Gimpel, 2016). We use RoPE positional embeddings (Su et al., 2024). We train using the AdamW optimizer (Loshchilov et al.,

  17. [17]

    We clip gradients by global norm

    with decay 0.001, batch size 256, peak learning rate 0.0001, warmup and final rate of 0.00001, and 2000 warmup steps. We clip gradients by global norm

  18. [18]

    For each predicate in each example, we sample its length i∈[n] and tokens uniformly at random, while ensuring there are no duplicates

    For the multi-token symbol experiments, the vocabulary includes 26 baseline tokens and sequences of these tokens form predicates. For each predicate in each example, we sample its length i∈[n] and tokens uniformly at random, while ensuring there are no duplicates. Facts EndFacts Rules EndRules Query EndQuery Answer Reasoning EndReasoning Newfact EndNewfac...

  19. [19]

    13 Preprint

    We generated reasoning traces of the form in Figure 7 by running forward chaining until either the query predicate was proven true or no further predicates could be proven true, and generating text corresponding to the algorithm trace. 13 Preprint. Under review. <BOS>Rules:If A and B then C EndIf If B then A EndIf Facts:B EndFacts Query:C ? Reasoning: Fac...

  20. [20]

    Active forgetting throughout the training can generalize with the vanilla architecture, provided sufficiently high symbolic diversity

    We observe that temporary active forgetting does not load to the desired inductive bias (symbolic reasoning), unless combined with copy attention. Active forgetting throughout the training can generalize with the vanilla architecture, provided sufficiently high symbolic diversity. 14 Preprint. Under review. Rules: If enchanting then UNK EndIf If perfect t...

  21. [21]

    ’prefix’ in figure title means that the query shares a prefix with another predicate, making them more easily confusable. While our work focuses primarily on unseen tokens, here we also conduct a small prelimi- nary study on unseenmulti-token variables(where all tokens are seen, but not all combinations are seen). This case has not been studied systematic...

  22. [22]

    By inspecting the samples containing errors, we observe the following qualitative behaviors

    While models generalize well to problems involving test symbols of lengths seen during training, performance gets worse for symbols that are longer by even one token. By inspecting the samples containing errors, we observe the following qualitative behaviors. Models trained with n= 2 and n= 3 tend to truncate symbols to n tokens; these errors also occur f...