Recognition: unknown
To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning
Pith reviewed 2026-05-09 22:14 UTC · model grok-4.3
The pith
Unembeddings of unseen tokens collapse to nearly identical vectors during training, blocking transformers from distinguishing new variables in logic tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. This collapse makes distinguishing multiple unseen variables difficult for the model, especially when the embedding and unembedding parameters are shared, and supplies a mechanistic account for why existing heuristic interventions like active forgetting succeed.
What carries the argument
Unembedding collapse, the convergence of last-layer weights for unseen tokens to a shared vector that prevents variable distinction in symbolic tasks.
If this is right
- Periodic resetting of token embeddings mitigates collapse and restores the ability to handle unseen variables.
- A small architecture adjustment that improves copying, paired with data diversity and embedding resets or freezing, produces generalization to unseen tokens.
- Unused tokens in models like the Gemma 3 family exhibit correlated embeddings that form poor starting points for downstream fine-tuning.
Where Pith is reading between the lines
- Similar unembedding collapse may limit generalization in other abstract or symbolic reasoning settings beyond propositional logic.
- Embedding resets or freezing could serve as a general tactic to boost out-of-distribution performance in language models handling novel symbols.
- Tracking similarity among unembedding vectors during training might act as an early warning for impending failures on unseen inputs.
Load-bearing premise
That the observed collapse of unembeddings is the primary driver of generalization failure rather than a side effect of other training dynamics.
What would settle it
Train a model in which unembeddings for unseen tokens remain distinct yet the model still cannot generalize to new variables in propositional logic problems.
Figures
read the original abstract
We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not observed during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and provides a mechanistic explanation for the effectiveness of existing heuristic interventions like "active forgetting", which periodically reset the token (un)embeddings. Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves generalization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems. Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the correlated embeddings of these tokens are a poor initialization for finetuning applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that decoder-only transformers fail to generalize to unseen variable names in in-context propositional logic reasoning tasks, due in part to a representational collapse in which the unembeddings (last-layer weights) of unseen tokens converge to nearly identical vectors during training. This collapse compounds the known copying difficulty for unseen tokens (especially under shared embedding/unembedding parameters) and provides a mechanistic account for the success of interventions like active forgetting. The authors derive the collapse theoretically, validate it empirically on synthetic propositional logic problems, and demonstrate that a combination of a small architecture change to facilitate copying, increased data diversity, and freezing or resetting of (un)embeddings restores generalization. They further report evidence of similar collapse in the Gemma 3 family and note its implications for finetuning.
Significance. If the central claim holds, the work supplies a concrete mechanistic explanation for a previously observed generalization failure in symbolic reasoning, distinct from but interacting with the copying issue. The controlled synthetic experiments, successful intervention combination, and out-of-distribution check on Gemma 3 strengthen the contribution and could guide improved training regimes for abstract reasoning and OOV handling in transformers.
minor comments (3)
- [Abstract] The abstract states that the proposed interventions 'achieve generalization to unseen tokens' but does not quantify the improvement (e.g., accuracy deltas or success rates) relative to baselines; adding a brief numerical summary would help readers assess the practical impact.
- [Abstract] The description of the 'small architecture change facilitating copying' is not elaborated in the provided abstract or summary; a one-sentence specification of the modification (e.g., which layer or attention mechanism is altered) would improve accessibility.
- [Experiments] In the empirical sections, confirm that all reported results include the number of random seeds or runs and any statistical tests, particularly for the ablation studies comparing the combined interventions against individual components.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our manuscript and the recommendation for minor revision. The provided summary accurately captures the central claims, including the role of unembedding collapse in limiting generalization to unseen tokens, its interaction with copying difficulties, and the effectiveness of the proposed interventions. We appreciate the recognition of the controlled experiments, theoretical derivation, and the out-of-distribution observations on Gemma 3.
Circularity Check
No significant circularity; derivation relies on independent empirical measurements and theoretical analysis
full rationale
The paper's core argument—that unembedding collapse of unseen tokens contributes to generalization failure—is grounded in direct empirical observations of vector collapse during training, controlled ablation experiments, and a theoretical analysis of training dynamics. These elements do not reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations whose validity depends on the current work. Prior work on copying difficulties is cited as background but is not invoked to justify the collapse mechanism or the proposed interventions. The Gemma-3 observation functions as an external check rather than an internal bootstrap. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Decoder-only transformers are trained with standard next-token prediction on sequences containing propositional logic expressions.
- domain assumption Embedding and unembedding matrices may be tied or untied.
Forward citations
Cited by 1 Pith paper
-
When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification
Regularized kernel logistic classifiers decompose into an ideal template classifier plus a perturbation from token overlaps modeled by a colored collision graph, yielding high-probability margin-transfer guarantees fo...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review arXiv
-
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv preprint arXiv:2210.05675 , year=
Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers.Advances in neural information processing systems, 35:18878–18891, 2022a. Stephanie CY Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K...
-
[5]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
-
[7]
10 Preprint. Under review. Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo J Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners.arXiv preprint arXiv:2406.11050,
-
[8]
[Accessed 2025-12-14]. Peter J. Liu, Roman Novak, Jaehoon Lee, Mitchell Wortsman, Lechao Xiao, Katie Everett, Alexander A. Alemi, Mark Kurzeja, Pierre Marcenac, Izzeddin Gur, Simon Kornblith, Kelvin Xu, Gamaleldin Elsayed, Ian Fischer, Jeffrey Pennington, Ben Adlam, and Jascha- Sohl Dickstein. Nanodo: A minimal transformer decoder-only language model impl...
2025
-
[9]
Decoupled Weight Decay Regularization
URLhttp://github.com/google-deepmind/nanodo. Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5:5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Frontier LLMs Still Struggle with Simple Reasoning Tasks , journal =
Alan Malek, Jiawei Ge, Nevena Lazi´ c, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks.arXiv preprint arXiv:2507.07313,
-
[11]
R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve.Proceedings of the National Academy of Sciences, 121(41): e2322420121, 2024a. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Grif- fiths. Whe...
-
[12]
arXiv preprint arXiv:2312.03002 , year=
Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task.arXiv preprint arXiv:2312.03002,
-
[13]
Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers.arXiv preprint arXiv:2305.16843,
-
[14]
Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan
11 Preprint. Under review. Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Posi- tional description matters for transformers arithmetic.arXiv preprint arXiv:2311.14737,
-
[15]
Under review
12 Preprint. Under review. A Proof of Lemma 4.1 Proof.Letp t i,n =p(i|x n,W (t),θ (t) ). The weights satisfy W(t+1) j −W (t+1) i = (1−λη t) W(t) j −W (t) i −η t 1 N N ∑ n=1 pt j,n −p t i,n ϕt n (2) By the Lagrange mean-value theorem, exp(b)−exp(a) =exp(c)(b−a) for some c such that a<c<b . Assume without loss of generality that ϕt⊤ n W(t) j ≥ϕ t⊤ n W(t) i ...
2024
-
[16]
We use RoPE positional embeddings (Su et al., 2024)
We use MLPs with hidden size 1024 and GELU activations (Hendrycks & Gimpel, 2016). We use RoPE positional embeddings (Su et al., 2024). We train using the AdamW optimizer (Loshchilov et al.,
2016
-
[17]
We clip gradients by global norm
with decay 0.001, batch size 256, peak learning rate 0.0001, warmup and final rate of 0.00001, and 2000 warmup steps. We clip gradients by global norm
2000
-
[18]
For each predicate in each example, we sample its length i∈[n] and tokens uniformly at random, while ensuring there are no duplicates
For the multi-token symbol experiments, the vocabulary includes 26 baseline tokens and sequences of these tokens form predicates. For each predicate in each example, we sample its length i∈[n] and tokens uniformly at random, while ensuring there are no duplicates. Facts EndFacts Rules EndRules Query EndQuery Answer Reasoning EndReasoning Newfact EndNewfac...
2023
-
[19]
13 Preprint
We generated reasoning traces of the form in Figure 7 by running forward chaining until either the query predicate was proven true or no further predicates could be proven true, and generating text corresponding to the algorithm trace. 13 Preprint. Under review. <BOS>Rules:If A and B then C EndIf If B then A EndIf Facts:B EndFacts Query:C ? Reasoning: Fac...
2025
-
[20]
Active forgetting throughout the training can generalize with the vanilla architecture, provided sufficiently high symbolic diversity
We observe that temporary active forgetting does not load to the desired inductive bias (symbolic reasoning), unless combined with copy attention. Active forgetting throughout the training can generalize with the vanilla architecture, provided sufficiently high symbolic diversity. 14 Preprint. Under review. Rules: If enchanting then UNK EndIf If perfect t...
2020
-
[21]
’prefix’ in figure title means that the query shares a prefix with another predicate, making them more easily confusable. While our work focuses primarily on unseen tokens, here we also conduct a small prelimi- nary study on unseenmulti-token variables(where all tokens are seen, but not all combinations are seen). This case has not been studied systematic...
2024
-
[22]
By inspecting the samples containing errors, we observe the following qualitative behaviors
While models generalize well to problems involving test symbols of lengths seen during training, performance gets worse for symbols that are longer by even one token. By inspecting the samples containing errors, we observe the following qualitative behaviors. Models trained with n= 2 and n= 3 tend to truncate symbols to n tokens; these errors also occur f...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.