pith. sign in

arxiv: 2510.26745 · v3 · pith:645CM2ASnew · submitted 2025-10-30 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Deep sequence models tend to memorize geometrically; it is unclear why

Pith reviewed 2026-05-21 20:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML
keywords geometric memorysequence modelsembeddingsspectral biasNode2Veccomposition reasoningTransformer memoryknowledge representation
0
0 comments X

The pith

Deep sequence models synthesize embeddings that encode global relationships between all entities, even without direct co-occurrence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that deep sequence models store atomic facts not just as local associative lookups but as geometric memory in their embeddings. These embeddings capture novel relationships across the entire set of entities, allowing an ℓ-fold composition reasoning problem to be solved as a single navigation step. This geometric structure emerges counterintuitively, even when it is more complex than brute-force memorization of pairs. The authors trace it to a natural spectral bias, demonstrated through its similarity to Node2Vec, rather than standard training, architecture, or optimization forces.

Core claim

Deep sequence models synthesize embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, it transforms a hard reasoning task involving an ℓ-fold composition into an easy-to-learn 1-step navigation task. The rise of such a geometry cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Instead, by analyzing a connection to Node2Vec, the geometry stems from a spectral bias that arises naturally despite the lack of various pressures.

What carries the argument

Geometric memory: embeddings that form a structure encoding global relationships, reducing multi-step composition to single-step navigation.

Load-bearing premise

The observed geometry stems from a spectral bias that arises naturally rather than from typical supervisory, architectural, or optimization pressures.

What would settle it

Training a sequence model on data where local co-occurrence statistics are preserved but global graph structure is removed, then checking whether the single-step navigation behavior disappears.

Figures

Figures reproduced from arXiv: 2510.26745 by Elan Rosenfeld, Sanjiv Kumar, Shahriar Noroozizadeh, Vaishnavh Nagarajan.

Figure 1
Figure 1. Figure 1: Associative vs. geometric memory of models trained on various graphs. There are two dramatically different ways to memorize a dataset of atomic facts. The common view is of associative memory: entities are embedded arbitrarily, and co-occurrences are stored in weight matrices. (left). §2.4: In practice, we find a geometric memory: the learned embeddings of a Transformer (middle) reflect global structure in… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of in-context path-star task of B&N’24. Each training and test example corresponds to a fresh, randomly-labeled path-star graph (a tree graph where only the root node branches into d paths of length ℓ). For each example, the prefix specifies a randomized adjacency list (of edge bigrams) of the corresponding graph, followed by (vroot, vgoal). The target is the full path (vroot → vgoal) in that grap… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our in-weights path-star task. All examples are derived from a fixed path-star graph. Training involves two types of examples: (i) edge memorization examples; (ii) path-finding examples, where the prefix is some leaf, and the target is the full path. Test examples are path examples corresponding to a held-out set of leaves. 2.1 The in-weights path-star task Task definition. For our study, we co… view at source ↗
Figure 4
Figure 4. Figure 4: Success of Transformer in in-weights path-star task. (§2) (left) A next-token-trained Transformer achieves perfect or highly non-trivial accuracy on large path-star graphs Gd,ℓ. (middle) Learning order of tokens. The tokens of a path are not learned in the reverse order i.e., the model does not learn the right-to-left solution. Thus, gradients from the future tokens are not critical for success. (right) Su… view at source ↗
Figure 5
Figure 5. Figure 5: Failure of Transformer in in-context path-star task. (B&N’24) We report the failure of next-token Transformers in the in-context version of the path-star task, reproducing results from B&N’24. (left) Full path accuracy remains at chance level across different small graph sizes. (middle) Learning order of tokens with teacherless (multi-token-trained) objective shows a clear right-to-left learning cascade. (… view at source ↗
Figure 6
Figure 6. Figure 6: Evidence of global geometry of Transformer in path-star task. (a) In the heatmap, entry (i, j) is the cosine distance between the leaf embedding of path i (row) and the first-hop embedding of path j (col). The clear diagonal line implies that embeddings within each path are more aligned, reflecting global structure. (b) UMAP projection of token embeddings where each point is a node em￾bedding; color indica… view at source ↗
Figure 7
Figure 7. Figure 7: Associative memory can be discovered quickly by gradient descent, given a sufficiently wide model, and a sufficiently large learning rate: For the various tiny graphs described in §C.3, and for our TinyNN model (with frozen embedding/unembedding layers and one wide trainable weight matrix to prevent a geometry from taking over; see §B.2.2), we report memorization over timesteps of training with full-batch … view at source ↗
Figure 8
Figure 8. Figure 8: Geometric memorization takes much longer for gradient descent to discover: For the same architecture as in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Spectral geometry arises in Node2Vec without low-rank pressure (Observation 3) for tiny Path-Star, Grid, Cycle, and Irregular Graphs (top to bottom). (a) The Fiedler-like vectors of the graphs encode global structure; this structure mirrors the Node2Vec embeddings shown in [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Embeddings of weight-untied Node2Vec do not show a clean geometry in the top directions. Corresponding multi-hop cosine similarities are in [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Embeddings of a weight-untied Transformer do not show a clean geometry in the top directions. Corresponding multi-hop cosine similarities are in [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Success of Transformer in in-weights path-star task. (left) A next-token-trained Trans￾former achieves perfect or highly non-trivial accuracy on large path-star graphs Gd,ℓ (Observation 4a). (middle) Learning order of tokens. The tokens of a path are not learned in the reverse order i.e., the model does not learn the right-to-left solution. Thus, gradients from the future tokens are not critical for succe… view at source ↗
Figure 13
Figure 13. Figure 13: Failure of Transformer in in-context path-star task. (B&N’24) We report the failure of next-token Transformers in the in-context version of the path-star task, reproducing results from B&N’24. (left) Full path accuracy remains at chance level across different small graph sizes. (middle) Learning order of tokens with teacherless (multi-token-trained) objective shows a clear right-to-left learning cascade. … view at source ↗
Figure 14
Figure 14. Figure 14: (left) Success of in-weights path-star task for Mamba. This figure is a counterpart to [PITH_FULL_IMAGE:figures/full_fig_p049_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Evidence of global geometry in path-star task for Mamba. This figure is a counterpart to Figs. 6 and 25 for the Mamba SSM architecture. Recall that entry (i, j) is the mean cosine distance between the leaf token in (an unseen) path i (row) and first/hardest token on (an unseen) path j (col). Each heatmap corresponds to a different training objective: Left: trained on edges and path-finding task (Dedge ∪ D… view at source ↗
Figure 16
Figure 16. Figure 16: UMAP projection of token embeddings of Mamba exhibits path-star topology. We corroborate the UMAP [109] observations from the Transformer ( [PITH_FULL_IMAGE:figures/full_fig_p050_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Transformer achieves non-trivial accuracy on the harder in-weights tree-star task. The tree-star task of §C.2 introduces decision-making at every step of the path, not just the first token. There are two variants of this task based on the test-train split. In the split on first token variant (top-left), we reserve some of the trees for generating training paths, and the rest for test paths. In the split o… view at source ↗
Figure 18
Figure 18. Figure 18: Tiny path-star: Geometries of various architectures on a smaller version of the path-star graph. See Observation 5. Associative Node2Vec Eigenvectors Transformer Neural Network Mamba SSM (Transformer with Frozen Embeddings) [PITH_FULL_IMAGE:figures/full_fig_p054_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Tiny grid: Geometries of various architectures on a small 4 × 4 grid graph. See Observation 5. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Tiny cycle: Geometries of various architectures on a small cycle graph. See Observation 5. Associative Node2Vec Eigenvectors Transformer Neural Network Mamba SSM (Transformer with Frozen Embeddings) [PITH_FULL_IMAGE:figures/full_fig_p055_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Tiny irregular graph: Geometries of various architectures on a small irregular graph of two connected components, both asymmetric. See Observation 5. Note that unlike in the other graphs, we do not use a 1-layer model here, but a 3-layered one. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: For various learning rates, geometric memorization takes much longer for gradient descent to discover: This is an extended version of [PITH_FULL_IMAGE:figures/full_fig_p056_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Node-node cosine similarity vs. adjacency matrix for the Transformer: For each graph, we plot the cosine similarity between all node embeddings and compare it against the adjacency matrix. Observe that the cosine similarities exhibit a richer structure than the adjacency matrix, reflecting some notion of multi-hop distance e.g., in the cycle graph, there is a gradual decrease in similarity as we walk towa… view at source ↗
Figure 24
Figure 24. Figure 24: Node-node cosine similarity vs. adjacency matrix for the Node2Vec model: Like in the Transformer ( [PITH_FULL_IMAGE:figures/full_fig_p057_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Evidence of global geometry in path-star task: Leaf-first-token cosine distance between node embeddings. We present again the heatmaps from [PITH_FULL_IMAGE:figures/full_fig_p058_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Evidence of global geometry in path-star task: Pathwise average cosine distance be￾tween node embeddings. While in [PITH_FULL_IMAGE:figures/full_fig_p059_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Zoomed in version of Fig [PITH_FULL_IMAGE:figures/full_fig_p059_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Mixed edge supervision enables forward path generation while forward-only fails due to reversal curse. Exact-match accuracy on held-out leaves for multiple path-star graphs (varying degree d and path length ℓ). As established, training on mixed edges Dedge yields high non-trivial forward accuracy across graphs. But training on forward-only D→ edge fails on both the forward and reverse tasks. This is indic… view at source ↗
Figure 29
Figure 29. Figure 29: Forward vs. reverse path generation: The figure contrasts the model’s performance on forward (start→leaf) and reverse (leaf→start) path generation tasks for path-star graphs learned either in-weights (left) or in-context (right). While both methods achieve perfect accuracy on the algorithmically simple reverse path task, their performance on the forward task differs dramatically. (left) The in-weights mod… view at source ↗
Figure 30
Figure 30. Figure 30: Embeddings of a Transformer with bi-directional vs. uni-directional edge memoriza￾tion. With our smaller graphs, we find a geometry arise regardless of whether the model is made to memorize both or only one direction of each edge. However, the geometry is weaker (e.g., for the grid graph) under uni-directional memorization. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: shows that adding a short sequence of pause tokens after the prompt reliably boosts exact￾match accuracy across graphs, for a given amount of training time. Increasing the number of pause tokens increases speed of convergence. 0 5 10 15 20 25 Epochs [Thousands] 0 20 40 60 80 100 Full Path Accuracy[%] Number of Pause Tokens: 0 2 4 6 [PITH_FULL_IMAGE:figures/full_fig_p063_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: ). The fact that the composition task is learnable in this regime implies that the edge-pretrained model must have come with an adequate global geometry despite being trained only on local supervision (Refutation 3a). G5 × 10 3 , 5 G10 4 , 6 G10 4 , 10 0.1 100 Accuracy[%] 100 100 42 100 100 55 0.02 0.01 0.01 In-Weights Path-Star Chance Level Full Path Hardest Token [PITH_FULL_IMAGE:figures/full_fig_p063_… view at source ↗
Figure 33
Figure 33. Figure 33: Node-node cosine similarity vs. adjacency matrix for weight-untied Node2Vec. As discussed in §4.4, observe that the node-to-node cosine-similarity matrix appears to be a bland near-zero matrix, lacking any information about the multi-hop connectivity of the underlying graph. Node Embedding Node Embedding Node-Node Similarity Matrix Node Index Node Index Adjacency Matrix 1.0 0.5 0.0 0.5 1.0 Similarity 0.0 … view at source ↗
Figure 34
Figure 34. Figure 34: Node-node cosine similarity vs. adjacency matrix for weight-untied Transformer. As discussed in §4.4, observe that the node-to-node cosine-similarity matrix appears to be a bland near-zero matrix, lacking any information about the multi-hop connectivity of the underlying graph. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Learning dynamics per token. (a) In the in-context setting of G2,5 (trained with a multi-token, teacherless objective since standard next-token prediction fails), later tokens are learned first indicating strong reliance on future-token signals. (b) In the in-weights setting of G104,6 with next-token prediction, token accuracies rise largely in tandem (or in a somewhat confusing order); the first token is… view at source ↗
read the original abstract

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that deep sequence models memorize atomic facts geometrically rather than via associative lookup of co-occurrences. Embeddings are synthesized to encode novel global relationships between all entities (including non-co-occurring ones), transforming an ℓ-fold composition reasoning task into an easy 1-step navigation task. The authors argue this geometry cannot be straightforwardly attributed to typical supervisory, architectural or optimizational pressures, is counterintuitively more complex than brute-force lookup, and instead arises from a spectral bias demonstrated via a connection to Node2Vec; they point to headroom for making Transformer memory more strongly geometric and implications for knowledge acquisition, capacity, discovery and unlearning.

Significance. If the empirical observations and the spectral-bias explanation hold, the work would be significant for offering a distinct geometric view of parametric memory that challenges prevailing intuitions about associative storage. The Node2Vec link could bridge empirical findings with spectral graph methods, while the practical suggestion for enhancing geometric properties in Transformers would be useful for practitioners working on reasoning and knowledge representation.

major comments (2)
  1. [Node2Vec analysis] Section on Node2Vec connection: the central claim that the observed global geometry 'stems from a spectral bias that arises naturally' rests on the Node2Vec analysis. This connection is presented as explanatory, yet remains analogical; the manuscript does not isolate whether random-walk co-occurrence statistics alone produce the reported eigenstructure independently of the sequence model's layered back-propagation and next-token loss. Without such isolation or controls, the argument that the geometry cannot be attributed to standard training dynamics is not yet secured.
  2. [Empirical observations] Empirical sections describing non-co-occurring pairs and ℓ-fold to 1-step transformation: the claim that embeddings encode novel global relationships (including for entities that never co-occur) is load-bearing for the 'geometric memory' phenomenon. Full methods, data-selection controls, and ablations are needed to rule out post-hoc artifacts, as the current presentation leaves open whether the geometry is a general tendency or specific to the chosen setups.
minor comments (2)
  1. [Abstract] Abstract: the term 'geometric memory' is introduced without a concise formal characterization on first use, which would help readers grasp the distinction from associative memory immediately.
  2. [Throughout] Notation: ensure consistent use of symbols for embeddings and entity relationships across sections to avoid ambiguity when discussing global vs. local associations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to clarify the Node2Vec analysis and strengthen the empirical controls. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Node2Vec analysis] Section on Node2Vec connection: the central claim that the observed global geometry 'stems from a spectral bias that arises naturally' rests on the Node2Vec analysis. This connection is presented as explanatory, yet remains analogical; the manuscript does not isolate whether random-walk co-occurrence statistics alone produce the reported eigenstructure independently of the sequence model's layered back-propagation and next-token loss. Without such isolation or controls, the argument that the geometry cannot be attributed to standard training dynamics is not yet secured.

    Authors: We appreciate the referee's emphasis on securing the isolation. The manuscript presents the Node2Vec link to show that the eigenstructure follows from the co-occurrence statistics generated by next-token prediction on sequences, which implicitly perform random walks on the entity graph; this is not merely analogical but follows because the training objective directly optimizes for those statistics. We agree that an explicit control would make the separation from layered back-propagation clearer. In the revision we will add a non-neural baseline that factorizes the empirical co-occurrence matrix derived from the same data and verifies that the reported spectral properties are recovered without any neural architecture or gradient-based training. revision: yes

  2. Referee: [Empirical observations] Empirical sections describing non-co-occurring pairs and ℓ-fold to 1-step transformation: the claim that embeddings encode novel global relationships (including for entities that never co-occur) is load-bearing for the 'geometric memory' phenomenon. Full methods, data-selection controls, and ablations are needed to rule out post-hoc artifacts, as the current presentation leaves open whether the geometry is a general tendency or specific to the chosen setups.

    Authors: We agree that additional documentation and controls are warranted to establish generality. The manuscript already specifies the synthetic data generation process and the criterion used to identify non-co-occurring pairs, but the presentation can be made more self-contained. In the revision we will expand the methods section with explicit data-selection rules, include supplementary ablations that vary the proportion of non-co-occurring pairs and the graph structure, and report results on an alternative synthetic task to demonstrate that the ℓ-fold to 1-step transformation and the global geometry persist beyond the primary experimental setups. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation grounded in external Node2Vec connection and empirical observations

full rationale

The paper identifies geometric memory through direct empirical analysis of sequence model embeddings that encode non-co-occurring relations and simplify composition tasks. The explanation that this stems from a spectral bias is explicitly tied to an analysis of the connection with the independent Node2Vec algorithm, whose random-walk co-occurrence mechanism is external to the present work and does not rely on the paper's own fitted values or definitions. No step reduces the claimed geometry to a self-definition, a renamed prediction of the same data, or a load-bearing self-citation chain. The contrast with typical supervisory pressures is presented as an argument from the Node2Vec parallel rather than a tautological claim. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the geometry is not explained by standard training pressures and that the Node2Vec connection captures the mechanism; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption The rise of geometric embeddings cannot be attributed to typical supervisory, architectural, or optimizational pressures.
    Abstract states this as a key argument against straightforward explanations.
invented entities (1)
  • geometric memory no independent evidence
    purpose: A form of storage where embeddings encode global relationships between entities that do not co-occur.
    New term introduced to describe the observed phenomenon distinct from associative memory.

pith-pipeline@v0.9.0 · 5792 in / 1054 out tokens · 25978 ms · 2026-05-21T20:33:41.954681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the geometry stems from a spectral bias that—in contrast to prevailing theories—indeed arises naturally despite the lack of various pressures... the converged solution... columns of embedding matrix V span the graph’s Fiedler-like vectors

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    global information readily available i.e., f(u)[v] is proportional to multi-hop distance... low-rank factorization of adjacency matrix

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geometric Factual Recall in Transformers

    cs.CL 2026-05 conditional novelty 8.0

    A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to ne...

Reference graph

Works this paper leans on

218 extracted references · 218 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    On the non-universality of deep learning: quantifying the cost of symmetry

    Emmanuel Abbe and Enric Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems, volume 35, pages 17188–17201. Curran Associates, Inc., 2022

  2. [2]

    Poly-time universality and limitations of deep learning

    Emmanuel Abbe and Colin Sandon. Poly-time universality and limitations of deep learning. arXiv preprint arXiv:2001.02992, 2020

  3. [3]

    On the universality of deep learning.Advances in Neural Information Processing Systems, 33:20061–20072, 2020

    Emmanuel Abbe and Colin Sandon. On the universality of deep learning.Advances in Neural Information Processing Systems, 33:20061–20072, 2020

  4. [4]

    On the power of differentiable learning versus pac and sq learning.Advances in Neural Information Processing Systems, 34:24340–24351, 2021

    Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, and Nathan Srebro. On the power of differentiable learning versus pac and sq learning.Advances in Neural Information Processing Systems, 34:24340–24351, 2021

  5. [5]

    Provable advantage of curriculum learning on parity targets with mixed inputs.Advances in Neural Information Processing Systems, 36:24291–24321, 2023

    Emmanuel Abbe, Elisabetta Cornacchia, and Aryo Lotfi. Provable advantage of curriculum learning on parity targets with mixed inputs.Advances in Neural Information Processing Systems, 36:24291–24321, 2023

  6. [6]

    Learning high-degree parities: The crucial role of the initialization

    Emmanuel Abbe, Elisabetta Cornacchia, Jan H ˛ azła, and Donald Kougang-Yombi. Learning high-degree parities: The crucial role of the initialization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=OuNIWgGGif

  7. [7]

    Hospedales

    Carl Allen and Timothy M. Hospedales. Analogies explained: Towards understanding word embeddings. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 ofProceedings of Machine Learning Research, pages 223–231. PMLR, 20...

  8. [8]

    Hospedales

    Carl Allen, Ivana Balazevic, and Timothy M. Hospedales. What the vec? towards probabilis- tically grounded embeddings. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Infor- mation Processing Systems 32: Annual Conference on Neural Information Processing Systems 201...

  9. [10]

    Physics of language models: Part 3.3, knowledge capacity scaling laws

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  10. [11]

    Implicit regularization in deep matrix factorization

    Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7411–7422, 2019

  11. [12]

    Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien

    Devansh Arpit, Stanisław Jastrz˛ ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, vo...

  12. [13]

    The pitfalls of next-token prediction

    Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 2296–2318, 2024

  13. [14]

    Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2(1):53–58, 1989

    Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2(1):53–58, 1989. doi: 10.1016/ 0893-6080(89)90014-2. URLhttps://doi.org/10.1016/0893-6080(89)90014-2

  14. [15]

    Robert J. N. Baldock, Hartmut Maennel, and Behnam Neyshabur. Deep learning through the lens of example difficulty. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, 29 Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Pro- cessing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIP...

  15. [16]

    Lessons from studying two-hop latent reasoning, 2025

    Mikita Balesni, Tomek Korbak, and Owain Evans. Lessons from studying two-hop latent reasoning, 2025. URLhttps://arxiv.org/abs/2411.16353

  16. [17]

    Bartlett

    Peter L. Bartlett. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network.IEEE Trans. Inf. Theory, 44 (2):525–536, 1998

  17. [18]

    a is b" fail to learn

    Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a". InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/ forum?id=...

  18. [19]

    On the inductive bias of neural tangent kernels.Advances in Neural Information Processing Systems, 32, 2019

    Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels.Advances in Neural Information Processing Systems, 32, 2019

  19. [20]

    Birth of a transformer: A memory viewpoint

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Hervé Jégou, and Léon Bottou. Birth of a transformer: A memory viewpoint. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023

  20. [21]

    Hopping too late: Exploring the limitations of large language models on multi-hop queries

    Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. Hopping too late: Exploring the limitations of large language models on multi-hop queries. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16,...

  21. [22]

    Boden.The Creative Mind - Myths and Mechanisms (2

    Margaret A. Boden.The Creative Mind - Myths and Mechanisms (2. ed.). Routledge, 2003

  22. [23]

    Reflections after refereeing papers for nips

    Leo Breiman. Reflections after refereeing papers for nips. InThe Mathematics of Generaliza- tion, pages 11–15. CRC Press, 2018

  23. [24]

    A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task

    Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, and Christian Bartelt. A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 4082–4102. Association for Computational Lin...

  24. [25]

    Memory transformer

    Mikhail S Burtsev, Yuri Kuratov, Anton Peganov, and Grigory V Sapunov. Memory transformer. arXiv preprint arXiv:2006.11527, 2020

  25. [26]

    Scaling laws for associative memories

    Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=Tzh6xAJSll

  26. [27]

    Learning associative memories with gradient descent

    Vivien Cabannes, Berfin Simsek, and Alberto Bietti. Learning associative memories with gradient descent. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview.net/ forum?id=A9fLbXLRTK

  27. [29]

    Stephanie C. Y . Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya K. Singh, Pierre H. Richemond, James L. McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIP...

  28. [30]

    Chang, Zhuowen Tu, and Benjamin K

    Tyler A. Chang, Zhuowen Tu, and Benjamin K. Bergen. The geometry of multilingual language model representations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 30 EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 119–136. Association...

  29. [31]

    Probing BERT in hyperbolic spaces

    Boli Chen, Yao Fu, Guangwei Xu, Pengjun Xie, Chuanqi Tan, Mosha Chen, and Liping Jing. Probing BERT in hyperbolic spaces. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

  30. [32]

    Theoretical limitations of multi-layer transformer

    Lijie Chen, Binghui Peng, and Hongxun Wu. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975, 2024

  31. [33]

    Understand- ing the interplay between parametric and contextual knowledge for large language models,

    Sitao Cheng, Liangming Pan, Xunjian Yin, Xinyi Wang, and William Yang Wang. Understand- ing the interplay between parametric and contextual knowledge for large language models,

  32. [34]

    URLhttps://arxiv.org/abs/2410.08414

  33. [35]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  34. [37]

    A mathematical model for curriculum learning for parities

    Elisabetta Cornacchia and Elchanan Mossel. A mathematical model for curriculum learning for parities. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 6402–6423. PMLR, 23–29 Jul 2023. URLhttps://proceedings.mlr.press/v202/cornacchia23a.html

  35. [38]

    Revisiting the graph reasoning ability of large language models: Case studies in translation, connectivity and shortest path, 2025

    Xinnan Dai, Qihao Wen, Yifei Shen, Hongzhi Wen, Dongsheng Li, Jiliang Tang, and Caihua Shan. Revisiting the graph reasoning ability of large language models: Case studies in translation, connectivity and shortest path, 2025. URL https://arxiv.org/abs/2408. 09529

  36. [39]

    Carlyle Morgan, and Owen G

    Andrew Davison, S. Carlyle Morgan, and Owen G. Ward. Community detection guarantees using embeddings learned by node2vec. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

  37. [40]

    Faith and fate: Limits of transformers on compositionality.Advances in Neural Information Processing Systems, 36, 2024

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality.Advances in Neural Information Processing Systems, 36, 2024

  38. [41]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022. URLhttps://arxiv.org/abs/2209.10652

  39. [42]

    Towards understanding linear word analogies

    Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Towards understanding linear word analogies. InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3253–3262. Association for Computational Linguistics, 2019

  40. [43]

    Does learning require memorization? a short tale about a long tail

    Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 954–959. ACM, 2020

  41. [44]

    Extractive structures learned in pretraining enable generalization on finetuned facts.arXiv preprint arXiv:2412.04614, 2024

    Jiahai Feng, Stuart Russell, and Jacob Steinhardt. Extractive structures learned in pretraining enable generalization on finetuned facts.arXiv preprint arXiv:2412.04614, 2024

  42. [45]

    Ferry, Joshua Ching, and Takashi Kawai

    Quentin RV . Ferry, Joshua Ching, and Takashi Kawai. Emergence and function of abstract representations in self-supervised transformers, 2023. URL https://arxiv.org/abs/ 2312.05361

  43. [46]

    On the creativity of large language models.CoRR, abs/2304.00008, 2023

    Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.CoRR, abs/2304.00008, 2023. 31

  44. [47]

    The mystery of the pathological path-star task for language models

    Arvid Frydenlund. The mystery of the pathological path-star task for language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 12493–12516. Association for Computational Linguistics, 2024

  45. [48]

    Language models, graph searching, and supervision adulteration: When more supervision is less and how to make more more

    Arvid Frydenlund. Language models, graph searching, and supervision adulteration: When more supervision is less and how to make more more. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, ...

  46. [49]

    Relational reasoning and inductive bias in transformers and large language models

    Jesse Geerts, Stephanie Chan, Claudia Clopath, and Kimberly Stachenfeld. Relational rea- soning and inductive bias in transformers trained on a transitive inference task, 2025. URL https://arxiv.org/abs/2506.04289

  47. [50]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5484–5495. Association for Computational Linguistics, 2021

  48. [51]

    Dissecting recall of factual associations in auto-regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12216–12235. Association for Computational Linguistics, 2023

  49. [52]

    Understanding finetuning for factual knowledge extraction

    Gaurav Rohit Ghosal, Tatsunori Hashimoto, and Aditi Raghunathan. Understanding finetuning for factual knowledge extraction. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https:// openreview.net/forum?id=cPsn9AcOYh

  50. [53]

    Learning dense representations for entity retrieval

    Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. Learning dense representations for entity retrieval. In Mohit Bansal and Aline Villavicencio, editors,Proceedings of the 23rd Conference on Computational Natural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, pages 5...

  51. [54]

    Alex Gittens, Dimitris Achlioptas, and Michael W. Mahoney. Skip-gram - zipf + uniform = vector additivity. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 69–76. Association for Computational Li...

  52. [55]

    Limits of end-to-end learning

    Tobias Glasmachers. Limits of end-to-end learning. InProceedings of The 9th Asian Con- ference on Machine Learning, ACML 2017, volume 77 ofProceedings of Machine Learning Research, pages 17–32. PMLR, 2017

  53. [57]

    Graph embedding techniques, applications, and performance: A survey.Knowl

    Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey.Knowl. Based Syst., 151:78–94, 2018. doi: 10.1016/J.KNOSYS.2018.03.022. URL https://doi.org/10.1016/j.knosys.2018.03.022

  54. [58]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. The Twelfth International Conference on Learning Representations, ICLR 2024, 2024

  55. [59]

    word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data

    Martin Grohe. word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In Dan Suciu, Yufei Tao, and Zhewei Wei, editors,Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14-19, 2020, pages 1–16. ACM, 2020

  56. [60]

    Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun

    Andrey Gromov. Grokking modular arithmetic, 2023. URL https://arxiv.org/abs/ 2301.02679. 32

  57. [61]

    Mamba: Linear-time sequence modeling with selective state spaces, 2023

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023

  58. [62]

    Knowledge matters: Importance of prior information for optimization.J

    Çaglar Gülçehre and Yoshua Bengio. Knowledge matters: Importance of prior information for optimization.J. Mach. Learn. Res., 17:8:1–8:32, 2016

  59. [63]

    Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro

    Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6151–6159, 2017

  60. [64]

    Gpt4graph: Can large language models understand graph structured data ? an empirical evaluation and benchmarking, 2023

    Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. Gpt4graph: Can large language models understand graph structured data ? an empirical evaluation and benchmarking, 2023. URLhttps://arxiv.org/abs/2305.15066

  61. [65]

    Mitigat- ing reversal curse in large language models via semantic-aware permutation training

    Qingyan Guo, Rui Wang, Junliang Guo, Xu Tan, Jiang Bian, and Yujiu Yang. Mitigat- ing reversal curse in large language models via semantic-aware permutation training. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, page...

  62. [66]

    Language models represent space and time

    Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  63. [67]

    URLhttps://openreview.net/forum?id=jE8xbmvFin

    OpenReview.net, 2024. URLhttps://openreview.net/forum?id=jE8xbmvFin

  64. [68]

    HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma

    Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self- supervised deep learning with spectral contrastive loss. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 5000–5011, 2021

  65. [69]

    Convergence guarantees for the deepwalk embedding on block models

    Christopher Harker and Aditya Bhaskara. Convergence guarantees for the deepwalk embedding on block models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview.net/ forum?id=xwxUbBHC1q

  66. [70]

    Lost in the Middle: How Language Models Use Long Contexts

    Tatsunori B. Hashimoto, David Alvarez-Melis, and Tommi S. Jaakkola. Word embeddings as metric recovery in semantic spaces.Trans. Assoc. Comput. Linguistics, 4:273–286, 2016. doi: 10.1162/TACL\_A\_00098. URLhttps://doi.org/10.1162/tacl_a_00098

  67. [71]

    Energy transformer

    Benjamin Hoover, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed Zaki, and Dmitry Krotov. Energy transformer. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 27532–27559. Curran Associates, Inc.,

  68. [72]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 57a9b97477b67936298489e3c1417b0a-Paper-Conference.pdf

  69. [73]

    Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982

    J J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. doi: 10.1073/pnas.79.8.2554. URL https://www.pnas.org/doi/abs/10.1073/pnas.79.8. 2554

  70. [74]

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301, 2023

  71. [75]

    Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Dinesh Jayaraman, Alex Lamb, and John Langford

    Edward S. Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Dinesh Jayaraman, Alex Lamb, and John Langford. The belief state transformer. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  72. [76]

    Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P. Heck. Learning deep structured semantic models for web search using clickthrough data. In Qi He, Arun Iyengar, Wolfgang Nejdl, Jian Pei, and Rajeev Rastogi, editors,22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 2...

  73. [77]

    Generalization or hallucination? understanding out-of-context reasoning in transformers

    Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I Jordan, Stuart Russell, and Song Mei. Generalization or hallucination? understanding out-of-context reasoning in transformers. InAdvances in Neural Information Processing Systems 39: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025, 2025

  74. [78]

    Position: The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps://openreview. net/forum?id=BH8TYy0r6u

  75. [79]

    The spectral underpinning of word2vec, 2020

    Ariel Jaffe, Yuval Kluger, Ofir Lindenbaum, Jonathan Patsenker, Erez Peterfreund, and Stefan Steinerberger. The spectral underpinning of word2vec, 2020

  76. [80]

    Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, and Stuart J. Russell. Evidence of learned look-ahead in a chess-playing neural network. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

  77. [81]

    Do llms dream of elephants (when told not to)? latent concept association and associative memory in transform- ers

    Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Do llms dream of elephants (when told not to)? latent concept association and associative memory in transform- ers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 1...

  78. [82]

    On the origins of linear representations in large language models

    Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam, and Victor Veitch. On the origins of linear representations in large language models. InForty-first International Conference on Machine Learning, ICML 2024, 2024

  79. [83]

    Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low- rank weight matrices universal approximators? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  80. [84]

    URLhttps://openreview.net/forum?id=nJnky5K944

Showing first 80 references.