pith. sign in

arxiv: 2605.22972 · v1 · pith:3PZZG5Q4new · submitted 2026-05-21 · 💻 cs.LG · cs.AI

A mathematical theory of balancing relational generalization and memorization

Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transitive inferencerelational generalizationexception memorizationkernel ridge regressionrepresentational geometrylanguage models
0
0 comments X

The pith

Kernel ridge regression balances transitive inference and exception memorization only under specific representational geometries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the transitive inference with exceptions task to study how systems learn a general relational rule while also memorizing a single violation of that rule. It provides an analytical characterization of kernel ridge regression solutions over a family of input representations and task parameters. The analysis shows these models can achieve the desired balance, yet success requires particular geometric properties of the representations that are not needed in the exception-free case. The same pattern of generalization and errors appears when the theory is tested by finetuning pretrained language models on ordered relations.

Core claim

Kernel ridge regression models can solve transitive inference while correctly handling one exception provided the representational geometry separates the general rule from the exception in kernel space; the same models fail for other geometries even when the task parameters are held fixed.

What carries the argument

Analytical solution of kernel ridge regression on embeddings of ordered tuples that include one explicit exception to the transitive rule.

If this is right

  • Generalization on the task requires geometries in which the exception does not produce destructive interference with the transitive kernel structure.
  • Pretrained language models finetuned on the task will exhibit both rule-consistent generalization and the systematic mistakes that follow from the geometry analysis.
  • The presence of even one exception makes the problem mechanically stricter than standard transitive inference because geometry must now be tuned to protect both the rule and the exception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometry sensitivity is likely to appear in other relational tasks that combine a dominant rule with isolated counterexamples.
  • Controlling embedding geometry during pretraining or fine-tuning could be used to improve exception handling without sacrificing rule generalization.
  • Direct tests on transformer architectures rather than kernel proxies would clarify whether the predicted geometry dependence survives in modern networks.

Load-bearing premise

Kernel ridge regression behavior is representative of how neural networks learn on this relational task.

What would settle it

If language models finetuned on ordered relations with one exception neither generalize according to the transitive rule nor produce the specific error pattern predicted by the kernel analysis, the claimed link between geometry and performance would not hold.

Figures

Figures reproduced from arXiv: 2605.22972 by Luke Cheng, Samuel Lippl.

Figure 1
Figure 1. Figure 1: Complex behavior in the real world requires a mixture of general rule learning and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Transitive inference with exceptions. Illustration showing the relevant training and test sets as well as the associated relation. White squares reflect item pairs without an expected generalization. 1. Memorization of transitive pairs: Does the model learn the training pairs where at least one item is in O(i) ? 2. Memorization of intransitive pairs: Does the model learn the training pairs where both items… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the theorem. A, The ranking system consists of the original rank from TI, r TI, plus a perturbation r pert (example here uses α = 0.2). B, Example ranking systems (for n = 9, p = 6, q = 4, c˜ → ∞). C, Any exchangeable representation is equivalent, via an orthonormal change of basis, to a four-hot representation where two units represent each input item xj and xk separately, one unit represe… view at source ↗
Figure 4
Figure 4. Figure 4: Behavior of the kernel model across task space. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Finetuning pretrained language models (PLMs) on relational data with exceptions. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors to unseen situations. This ability requires us to learn rules and regularities that allow for such generalizations. At the same time, in most complex environments, any rule will have its exceptions. How do learning systems balance between learning general regularities and memorizing exceptions? We argue that a lack of task paradigms has hindered the study of this essential ability. To address this gap, we introduce a novel task, transitive inference with exceptions, that tests for relational generalization and memorization of an exception to the relational rule. We then analytically characterize the behavior of a simple, theoretically tractable model of neural network learning (kernel ridge regression) across a broad family of representations and task parameters. We find that these models can balance between relational generalization and memorization, but unlike for transitive inference without an exception, successful generalization is sensitive to the specific representational geometry. We explain why this task is more challenging mechanistically by drawing on our analytical theory. Finally, we validate our theoretical insights in pretrained language models that are finetuned on ordered relations, finding that these models successfully generalize according to the transitive rule, but also make the kinds of systematic mistakes predicted by our theory. Overall, our theory shows how learning systems can balance between relational generalization and memorization, explains how this can go wrong, and emphasizes the need for new task paradigms designed to probe this ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the transitive inference with exceptions task to probe how learning systems balance relational generalization against memorization of exceptions. It provides an analytical characterization of kernel ridge regression (KRR) across a family of representations and task parameters, concluding that these models can achieve the balance but that successful generalization becomes sensitive to representational geometry precisely when an exception is present (unlike the exception-free case). The authors explain the mechanistic source of this sensitivity, then show that the predicted error patterns appear when pretrained language models are finetuned on ordered relations.

Significance. If the KRR analysis is internally sound and the geometry-sensitivity predictions are borne out beyond the convex fixed-feature setting, the work supplies a concrete mathematical account of the generalization-memorization trade-off on relational tasks and a falsifiable link to observed LM behavior. The explicit analytical treatment of KRR and the direct comparison to LM error patterns are strengths that would be valuable to the community.

major comments (2)
  1. [Abstract / analytical characterization] Abstract (paragraph on analytical characterization) and the modeling section: the central claim that 'successful generalization is sensitive to the specific representational geometry' is derived under kernel ridge regression with fixed features. Because KRR solves a convex problem in a predetermined feature space, it cannot capture representation learning under gradient descent; the manuscript does not demonstrate that the same geometry dependence survives once features are allowed to adapt, which is required to underwrite the mechanistic explanation offered for language-model behavior.
  2. [LM validation experiments] Validation section (LM experiments): the reported systematic mistakes in finetuned LMs are said to match the theory's predictions, yet the manuscript does not report controls that isolate representational geometry (e.g., by varying embedding dimensionality or kernel bandwidth while holding other factors fixed). Without such controls it remains unclear whether the observed errors are produced by the same mechanism identified in the KRR analysis.
minor comments (1)
  1. [Task definition] Notation for the exception parameter and the geometry family should be introduced with a single consolidated table or figure early in the task-definition section to reduce cross-referencing.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / analytical characterization] Abstract (paragraph on analytical characterization) and the modeling section: the central claim that 'successful generalization is sensitive to the specific representational geometry' is derived under kernel ridge regression with fixed features. Because KRR solves a convex problem in a predetermined feature space, it cannot capture representation learning under gradient descent; the manuscript does not demonstrate that the same geometry dependence survives once features are allowed to adapt, which is required to underwrite the mechanistic explanation offered for language-model behavior.

    Authors: We agree that the analytical results are derived exclusively for kernel ridge regression with fixed features, which permits the closed-form characterization across representations and task parameters. The manuscript presents KRR as a tractable model of neural network learning but does not provide analysis or experiments showing that the reported geometry sensitivity persists when features adapt under gradient descent. We will revise the abstract and modeling section to explicitly state this scope limitation and add discussion of the implications for the mechanistic account of language-model behavior. revision: partial

  2. Referee: [LM validation experiments] Validation section (LM experiments): the reported systematic mistakes in finetuned LMs are said to match the theory's predictions, yet the manuscript does not report controls that isolate representational geometry (e.g., by varying embedding dimensionality or kernel bandwidth while holding other factors fixed). Without such controls it remains unclear whether the observed errors are produced by the same mechanism identified in the KRR analysis.

    Authors: We acknowledge that the LM experiments do not include explicit controls that isolate representational geometry while holding other factors fixed. The reported error patterns are consistent with the KRR predictions, but alternative mechanisms cannot be ruled out without additional controls. We will revise the validation section to discuss this limitation and, where feasible, incorporate supplementary analyses (e.g., varying model embedding dimensions or using controlled synthetic representations) to better link the observations to the KRR mechanism. revision: partial

standing simulated objections not resolved
  • Demonstrating that the geometry dependence survives under adaptive feature learning via gradient descent

Circularity Check

0 steps flagged

No circularity: analytical KRR characterization is self-contained

full rationale

The paper performs an analytical characterization of kernel ridge regression behavior on the transitive inference with exceptions task across representational geometries. This is a direct mathematical derivation from the KRR objective and kernel definitions rather than any fitting of parameters to target outputs followed by relabeling as prediction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced as load-bearing. The subsequent empirical validation on language models is presented as an independent check, not part of the core derivation chain. Absent any quoted equations that reduce the claimed results to their inputs by construction, the derivation chain does not exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full text required to populate the ledger.

pith-pipeline@v0.9.0 · 5783 in / 1087 out tokens · 21049 ms · 2026-05-25T05:53:38.472174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 3 internal anchors

  1. [1]

    The discovery of structural form

    Charles Kemp and Joshua B Tenenbaum. “The discovery of structural form”. In:Proceedings of the National Academy of Sciences105.31 (2008), pp. 10687–10692

  2. [2]

    Building machines that learn and think like people

    Brenden M Lake et al. “Building machines that learn and think like people”. In:Behavioral and brain sciences40 (2017), e253

  3. [3]

    Connectionism and cognitive architecture: A critical analysis

    Jerry A Fodor and Zenon W Pylyshyn. “Connectionism and cognitive architecture: A critical analysis”. In:Cognition28.1-2 (1988), pp. 3–71

  4. [4]

    Compositionality decomposed: How do neural networks generalise?

    Dieuwke Hupkes et al. “Compositionality decomposed: How do neural networks generalise?” In:Journal of Artificial Intelligence Research67 (2020), pp. 757–795

  5. [5]

    Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

    Daniel Keysers et al. “Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”. In:International Conference on Learning Representations. 2020.URL: https://openreview.net/forum?id=SygcCnNKwr

  6. [6]

    Representation of real- world event schemas during narrative perception

    Christopher Baldassano, Uri Hasson, and Kenneth A Norman. “Representation of real- world event schemas during narrative perception”. In:Journal of Neuroscience38.45 (2018), pp. 9689–9699

  7. [7]

    David E Rumelhart and James L Mcclelland.On Learning the Past Tenses of English Verbs. Tech. rep. 1985

  8. [8]

    Relational knowledge: The foundation of higher cognition

    Graeme S Halford, William H Wilson, and Steven Phillips. “Relational knowledge: The foundation of higher cognition”. In:Trends in cognitive sciences14.11 (2010), pp. 497–505

  9. [9]

    Relational inductive biases, deep learning, and graph networks

    Peter W Battaglia et al. “Relational inductive biases, deep learning, and graph networks”. In: arXiv preprint arXiv:1806.01261(2018)

  10. [10]

    Judgment and reasoning in the child

    J Piaget. “Judgment and reasoning in the child.” In: (1928)

  11. [11]

    Transitive inferences and memory in young children

    Peter E Bryant and Thomas Trabasso. “Transitive inferences and memory in young children.” In:Nature(1971)

  12. [12]

    Asymmetric reinforcement learning facilitates human inference of transitive relations

    Simon Ciranka et al. “Asymmetric reinforcement learning facilitates human inference of transitive relations”. In:Nature Human Behaviour6.4 (2022), pp. 555–564. 10

  13. [13]

    Neural knowledge assembly in humans and neural networks

    Stephanie Nelli et al. “Neural knowledge assembly in humans and neural networks”. In: Neuron111.9 (2023), pp. 1504–1516

  14. [14]

    Are monkeys logical?

    Brendan O McGonigle and Margaret Chalmers. “Are monkeys logical?” In:Nature267.5613 (1977), pp. 694–696

  15. [15]

    Transitive inference in rats (Rattus norvegicus)

    Hank Davis. “Transitive inference in rats (Rattus norvegicus).” In:Journal of Comparative Psychology106.4 (1992), p. 342

  16. [16]

    Fish can infer social rank by observation alone

    Logan Grosenick, Tricia S Clement, and Russell D Fernald. “Fish can infer social rank by observation alone”. In:Nature445.7126 (2007), pp. 429–432

  17. [17]

    Transitive inference in Polistes paper wasps

    Elizabeth A Tibbetts et al. “Transitive inference in Polistes paper wasps”. In:Biology letters 15.5 (2019)

  18. [18]

    Transitive choices by a simple, fully con- nected, backpropagation neural network: implications for the comparative study of transitive inference

    Carlo De Lillo, D Floreano, and F Antinucci. “Transitive choices by a simple, fully con- nected, backpropagation neural network: implications for the comparative study of transitive inference.” In:Animal Cognition4.1 (2001), pp. 61–68

  19. [19]

    A geometrical solution underlies general neural principle for serial ordering

    Gabriele Di Antonio, Sofia Raglio, and Maurizio Mattia. “A geometrical solution underlies general neural principle for serial ordering”. In:Nature Communications15.1 (2024), p. 8238

  20. [20]

    Emergent neural dynamics and geometry for generalization in a transitive inference task

    Kenneth Kay et al. “Emergent neural dynamics and geometry for generalization in a transitive inference task”. In:PLOS Computational Biology20.4 (2024), e1011954

  21. [21]

    A mathematical theory of relational generalization in transitive inference

    Samuel Lippl et al. “A mathematical theory of relational generalization in transitive inference”. In:Proceedings of the National Academy of Sciences121.28 (2024), e2314511121

  22. [22]

    Relational reasoning and inductive bias in transformers and large language models

    Jesse Geerts et al. “Relational reasoning and inductive bias in transformers trained on a transitive inference task”. In:arXiv preprint arXiv:2506.04289(2025)

  23. [23]

    ReCogLab: a framework testing relational reasoning & cognitive hypothe- ses on LLMs

    Andrew Liu et al. “ReCogLab: a framework testing relational reasoning & cognitive hypothe- ses on LLMs”. In:The Thirteenth International Conference on Learning Representations. 2025

  24. [24]

    Some properties of configural learning: an investigation of the transverse-patterning problem

    Maria C Alvarado and Jerry W Rudy. “Some properties of configural learning: an investigation of the transverse-patterning problem.” In:Journal of Experimental Psychology: Animal Behavior Processes18.2 (1992), p. 145

  25. [25]

    Configural learning in humans: The transverse patterning problem

    Robert S Astur and Robert J Sutherland. “Configural learning in humans: The transverse patterning problem”. In:Psychobiology26.3 (1998), pp. 176–182

  26. [26]

    The hippocampus and transverse patterning guided by olfactory cues

    Jeffery A Dusek and Howard Eichenbaum. “The hippocampus and transverse patterning guided by olfactory cues.” In:Behavioral neuroscience112.4 (1998), p. 762

  27. [27]

    Accessed: 2026-05-20

    CHESSFOX.Légal’s Mate. Accessed: 2026-05-20. n.d.URL: https://chessfox.com/ legals-mate/

  28. [28]

    Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks

    Brenden Lake and Marco Baroni. “Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks”. In:International conference on machine learning. PMLR. 2018, pp. 2873–2882

  29. [29]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson et al. “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 2901–2910

  30. [30]

    Towards a Formal Theory of Representational Compositionality

    Eric Elmoznino et al. “Towards a Formal Theory of Representational Compositionality”. In:Forty-second International Conference on Machine Learning. 2025.URL: https:// openreview.net/forum?id=fXCfT7ErvL

  31. [31]

    Break it down: Evidence for structural compositionality in neural networks

    Michael Lepori, Thomas Serre, and Ellie Pavlick. “Break it down: Evidence for structural compositionality in neural networks”. In:Advances in Neural Information Processing Systems 36 (2023), pp. 42623–42660

  32. [32]

    Discovering modular solutions that generalize compositionally

    Simon Schug et al. “Discovering modular solutions that generalize compositionally”. In: The Twelfth International Conference on Learning Representations. 2024.URL: https : //openreview.net/forum?id=H98CVcX1eh

  33. [33]

    Compositional generalization from first principles

    Thaddäus Wiedemer et al. “Compositional generalization from first principles”. In:Advances in Neural Information Processing Systems36 (2023), pp. 6941–6960

  34. [34]

    Provable Compositional Generalization for Object-Centric Learn- ing

    Thaddäus Wiedemer et al. “Provable Compositional Generalization for Object-Centric Learn- ing”. In:The Twelfth International Conference on Learning Representations. 2024.URL: https://openreview.net/forum?id=7VPTUWkiDQ

  35. [35]

    Interaction Asymmetry: A General Principle for Learning Composable Abstractions

    Jack Brady et al. “Interaction Asymmetry: A General Principle for Learning Composable Abstractions”. In:The Thirteenth International Conference on Learning Representations. 2025.URL:https://openreview.net/forum?id=cCl10IU836. 11

  36. [36]

    On The Specialization of Neural Modules

    Devon Jarvis et al. “On The Specialization of Neural Modules”. In:The Eleventh International Conference on Learning Representations. 2023.URL: https://openreview.net/forum? id=Fh97BDaR6I

  37. [37]

    The Tolman-Eichenbaum machine: unifying space and rela- tional memory through generalization in the hippocampal formation

    James CR Whittington et al. “The Tolman-Eichenbaum machine: unifying space and rela- tional memory through generalization in the hippocampal formation”. In:Cell183.5 (2020), pp. 1249–1263

  38. [38]

    The relational bottleneck as an inductive bias for efficient abstraction

    Taylor W Webb et al. “The relational bottleneck as an inductive bias for efficient abstraction”. In:Trends in Cognitive Sciences28.9 (2024), pp. 829–843

  39. [39]

    Transitive inference in non-human animals: An empirical and theoretical analysis

    Marco Vasconcelos. “Transitive inference in non-human animals: An empirical and theoretical analysis”. In:Behavioural Processes78.3 (2008), pp. 313–334

  40. [40]

    Serial learning

    Greg Jensen. “Serial learning.” In: (2017)

  41. [41]

    On the paradox of three random variables

    Stanisław Trybuła. “On the paradox of three random variables”. In:Applicationes Mathemati- cae5.4 (1961), pp. 321–332

  42. [42]

    How vicious are cycles of intransitive choice?

    Maya Bar-Hillel and Avishai Margalit. “How vicious are cycles of intransitive choice?” In: Theory and decision24.2 (1988), pp. 119–145

  43. [43]

    Santiago Soliveres and Eric Allan.Everything you always wanted to know about intransitive competition but were afraid to ask. 2018

  44. [44]

    Intransitivity in theory and in the real world

    Alexander Y Klimenko. “Intransitivity in theory and in the real world”. In:Entropy17.6 (2015), pp. 4364–4412

  45. [45]

    Intransitive dice

    Brian Conrey et al. “Intransitive dice”. In:Mathematics Magazine89.2 (2016), pp. 133–143

  46. [46]

    A difficulty in the concept of social welfare

    Kenneth J Arrow. “A difficulty in the concept of social welfare”. In:Journal of political economy58.4 (1950), pp. 328–346

  47. [47]

    Information aggregation, rationality, and the Condorcet jury theorem

    David Austen-Smith and Jeffrey S Banks. “Information aggregation, rationality, and the Condorcet jury theorem”. In:American political science review90.1 (1996), pp. 34–45

  48. [48]

    The topology of poker

    Laurent Bartholdi and Roman Mikhailov. “The topology of poker”. In:Games and Economic Behavior(2025)

  49. [49]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel: Convergence and generalization in neural networks”. In:Advances in neural information processing systems31 (2018)

  50. [50]

    On lazy training in differentiable programming

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in differentiable programming”. In:Advances in neural information processing systems32 (2019)

  51. [51]

    Out-of-distribution generaliza- tion in kernel regression

    Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. “Out-of-distribution generaliza- tion in kernel regression”. In:Advances in Neural Information Processing Systems34 (2021), pp. 12600–12612

  52. [52]

    Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks

    Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. “Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks”. In:Nature communications12.1 (2021), p. 2914

  53. [53]

    Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting

    Neil Mallinar et al. “Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting”. In:Advances in neural information processing systems35 (2022), pp. 1182– 1195

  54. [54]

    An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression

    Lijia Zhou et al. “An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression”. In:The Twelfth International Conference on Learning Representations. 2024.URL: https: //openreview.net/forum?id=YrTI2Zu0dd

  55. [55]

    Predicting Kernel Regression Learning Curves from Only Raw Data Statistics

    Dhruva Karkada et al. “Predicting Kernel Regression Learning Curves from Only Raw Data Statistics”. In:The Fourteenth International Conference on Learning Representations. 2026. URL:https://openreview.net/forum?id=nn5Vf6GEsV

  56. [56]

    Generalization on the unseen, logic reasoning and degree curriculum

    Emmanuel Abbe et al. “Generalization on the unseen, logic reasoning and degree curriculum”. In:Journal of Machine Learning Research25.331 (2024), pp. 1–58

  57. [57]

    When does compositional structure yield compositional generalization? A kernel theory

    Samuel Lippl and Kim Stachenfeld. “When does compositional structure yield compositional generalization? A kernel theory.” In:The Thirteenth International Conference on Learning Representations. 2025.URL:https://openreview.net/forum?id=FPBce2P1er

  58. [58]

    A kernel-based view of language model fine-tuning

    Sadhika Malladi et al. “A kernel-based view of language model fine-tuning”. In:International Conference on Machine Learning. PMLR. 2023, pp. 23610–23641

  59. [59]

    Linearization Explains Fine-Tuning in Large Language Models

    Zahra Rahimi Afzal et al. “Linearization Explains Fine-Tuning in Large Language Models”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2026. URL:https://openreview.net/forum?id=tdwRIP6NG2. 12

  60. [60]

    Optimal Regularization can Mitigate Double Descent

    Preetum Nakkiran et al. “Optimal Regularization can Mitigate Double Descent”. In:Interna- tional Conference on Learning Representations. 2021.URL: https://openreview.net/ forum?id=7R7fAoUygoa

  61. [61]

    The generalization error of random features regression: Precise asymptotics and the double descent curve

    Song Mei and Andrea Montanari. “The generalization error of random features regression: Precise asymptotics and the double descent curve”. In:Communications on Pure and Applied Mathematics75.4 (2022), pp. 667–766

  62. [62]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:Pro- ceedings of the national academy of sciences114.13 (2017), pp. 3521–3526

  63. [63]

    Revisiting catastrophic forgetting in large language model tuning

    Hongyu Li et al. “Revisiting catastrophic forgetting in large language model tuning”. In: Findings of the association for computational linguistics: EMNLP 2024. 2024, pp. 4297– 4308

  64. [64]

    Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

    Louis Béthune et al. “Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection”. In:Forty-second International Conference on Machine Learning. 2025.URL: https://openreview.net/forum?id=vWMij23BmQ

  65. [65]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In:Interna- tional Conference on Learning Representations. 2022.URL: https://openreview.net/ forum?id=nZeVKeeFYf9

  66. [66]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron et al. “Llama: Open and efficient foundation language models”. In:arXiv preprint arXiv:2302.13971(2023)

  67. [67]

    Qwen Team.Qwen3.5: Towards Native Multimodal Agents. Feb. 2026.URL: https://qwen. ai/blog?id=qwen3.5

  68. [68]

    On the conflict between logic and belief in syllogistic reasoning

    J St BT Evans, Julie L Barston, and Paul Pollard. “On the conflict between logic and belief in syllogistic reasoning”. In:Memory & cognition11.3 (1983), pp. 295–306

  69. [69]

    Belief bias in children’s reasoning

    Jonathan St BT Evans and Tania S Perry. “Belief bias in children’s reasoning.” In:Cahiers de Psychologie Cognitive/Current Psychology of Cognition(1995)

  70. [70]

    Language models show human-like content effects on reasoning tasks

    Ishita Dasgupta et al. “Language models show human-like content effects on reasoning tasks”. In:arXiv preprint arXiv:2207.07051(2022)

  71. [71]

    Language models, like humans, show content effects on reasoning tasks

    Andrew K Lampinen et al. “Language models, like humans, show content effects on reasoning tasks”. In:PNAS nexus3.7 (2024), pgae233

  72. [72]

    Transitive Inference in Large Language Models and Prompt- ing Intervention

    Wenya Wu and Weihong Deng. “Transitive Inference in Large Language Models and Prompt- ing Intervention”. In:ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2025, pp. 1–5

  73. [73]

    LoRA Learns Less and Forgets Less

    Dan Biderman et al. “LoRA Learns Less and Forgets Less”. In:Transactions on Ma- chine Learning Research(2024). Featured Certification.ISSN: 2835-8856.URL: https : //openreview.net/forum?id=aloEru2qCG

  74. [74]

    Complementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning

    Anna C Schapiro et al. “Complementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning”. In: Philosophical Transactions of the Royal Society B: Biological Sciences372.1711 (2017)

  75. [75]

    Human-like systematic generalization through a meta- learning neural network

    Brenden M Lake and Marco Baroni. “Human-like systematic generalization through a meta- learning neural network”. In:Nature623.7985 (2023), pp. 115–121

  76. [76]

    Exact learning dynamics of deep linear networks with prior knowledge

    Clémentine C J Dominé et al. “Exact learning dynamics of deep linear networks with prior knowledge”. In:Journal of Statistical Mechanics: Theory and Experiment2023.11 (2023), p. 114004

  77. [77]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press et al. “Measuring and narrowing the compositionality gap in language models”. In: Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, pp. 5687– 5711

  78. [78]

    The Reversal Curse: LLMs trained on “A is B

    Lukas Berglund et al. “The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A””. In:The Twelfth International Conference on Learning Representations. 2024.URL: https://openreview.net/forum?id=GPKTIktA0k

  79. [79]

    Adaptive compositional continual meta-learning

    Bin Wu et al. “Adaptive compositional continual meta-learning”. In:International Conference on Machine Learning. PMLR. 2023, pp. 37358–37378

  80. [80]

    Scaling can lead to compositional generalization

    Florian Redhardt, Yassir Akram, and Simon Schug. “Scaling can lead to compositional generalization”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025.URL:https://openreview.net/forum?id=hZt0daVIZi. 13

Showing first 80 references.