arxiv: 2604.13281 · v1 · submitted 2026-04-14 · 💻 cs.NE · q-bio.NC

Recognition: unknown

Attention to task structure for cognitive flexibility

Xiaoyu K. Zhang , Mehdi Senoussi , Tom Verguts

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:16 UTC · model grok-4.3

classification 💻 cs.NE q-bio.NC

keywords cognitive flexibilitymulti-task learningattention modelsgraph theorytask connectivitygeneralizationstabilityneural networks

0 comments

The pith

Task connectivity in the environment strongly modulates both stability and generalization, with pronounced benefits for attention-based models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how environmental structure shapes cognitive flexibility, the balance between retaining old task knowledge and applying it to new ones. It builds an artificial setting where tasks arise from cue-dimension combinations and applies graph theory to quantify how connected those tasks are to each other. Across varying levels of environmental richness and connectivity, attention models that gate or concatenate components outperform standard multilayer perceptrons, especially when tasks share many links. The work shows that both model design and the graph of the surrounding tasks matter for flexible learning.

Core claim

In a multi-task environment defined by combinations of two cue dimensions and characterized by graph-theory connectivity, richer task sets improve both generalization and stability, while higher connectivity between tasks further boosts these measures with especially large gains for gating-based and concatenation-based attention models relative to multilayer perceptrons.

What carries the argument

Graph-theory connectivity between tasks in the cue-dimension environment, which interacts with multiplicative gating and concatenation attention mechanisms to enable task decomposition and transfer.

If this is right

Richer task environments produce simultaneous gains in retaining prior knowledge and transferring it to novel tasks.
Higher graph connectivity between tasks improves both stability and generalization measures for all models.
Attention models that sequentially allocate focus to task components show larger performance lifts from connectivity than multilayer perceptrons.
Environmental structure and model architecture interact to determine overall cognitive flexibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training curricula for artificial agents could be designed by deliberately increasing relevant task connections to improve flexibility without altering the network architecture.
The same connectivity principle may apply to human skill acquisition, where overlapping task structures could accelerate both retention and transfer.
Graph analysis of task relations could become a practical tool for predicting which model families will perform best in a given domain.

Load-bearing premise

The artificial environment of cue-dimension combinations and graph connectivity captures the main factors that govern cognitive flexibility in natural settings.

What would settle it

Finding that attention models lose their connectivity advantage, or that connectivity no longer modulates stability and generalization, in a different multi-task environment would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13281 by Mehdi Senoussi, Tom Verguts, Xiaoyu K. Zhang.

**Figure 2.** Figure 2: Learning curves in Multi-2 for three model types. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of models in generalization and stability across different [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental design for Multi-4 and connectivity graphs. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Learning curves in the Multi-4 Middle environment for 3 model types. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effects of environmental richness and first-regime connectivity on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Task-wise performance in the Multi-4 Middle environment. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy across the 17 unique connected first-regime variants in the [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Violin plots of performance across 17 unique connected first-regime [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Linear regression of model performance vs. first-regime connectivity [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Linear regression of model performance vs. first-regime connectivity [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Cue sensitivity across poor, middle, and rich environments. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Cue sensitivity in the middle environment under connected (Ctd) [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Model architectures (for Multi-4). (a) Dense layer in MLP. (b) Mechanism in Attention-Gating layer. (c) Mechanism in Attention-Concatenate layer. (d) MLP model processes cues and stimuli together with backpropagation. (e) Attention-Gating Model uses Attention-Gating layers (Gate1 and Gate2) to gate stimulus features guided by cue features. In the upper plot, filled circle means active, open circle inactiv… view at source ↗

read the original abstract

Humans and artificial agents must often learn and switch between multiple tasks in dynamic environments. Success in such settings requires cognitive flexibility: the ability to retain prior knowledge (cognitive stability) while also transferring it to novel tasks (cognitive generalization). Cognitive flexibility research has largely focused on the role of model architecture to achieve these complementary goals. However, it is less well understood how the structure of the environment itself influences cognitive flexibility, and how it interacts with model architecture. To address this gap, we design a multi-task learning environment in which tasks are defined by a combination of two cue dimensions, allowing us to characterize the environment with graph-theory methods. We also introduce gating-based (multiplicative) and concatenation-based attention models that can decompose tasks into components and can sequentially allocate attention to them. We compare the attention-based models' performance in the multi-task learning environment to multilayer perceptrons. Generalization and stability are systematically evaluated across environments that vary in richness and task connectivity. We observe that richer environments improve both generalization and stability. In addition, a critical novel observation is that (graph theory based) connectivity between the tasks in the environment strongly modulates both stability and generalization, with especially pronounced benefits for attention-based models. These findings underscore the importance of considering not only cognitive architectures but also environmental structure and their interaction in shaping multi-task learning, generalization, and stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task connectivity in their cue-dimension graph setup improves stability and generalization more for the new attention models than MLPs, but the artificial environment keeps the result contained.

read the letter

The paper's central observation is that in a multi-task setup where tasks are built from cue-dimension pairs and linked by graph connectivity, higher connectivity between tasks boosts both stability (retaining old knowledge) and generalization (transfer to new tasks), with noticeably larger gains for the gating and concatenation attention models than for plain MLPs. Richer environments also help on both measures. They introduce the two attention variants specifically to let models break tasks into components and allocate attention sequentially, then compare them systematically across environments that differ in richness and connectivity. That interaction between environmental structure and architecture is the main new angle, and the graph-theoretic framing gives a clean way to vary connectivity without changing the basic task definition. The systematic sweeps across environment types are a clear strength; they make the modulation effect easy to see and compare. The simulations appear to support the claims without obvious internal contradictions or missing controls from what is described. The main limitation is that the whole thing lives inside an artificial cue-dimension graph. It is not obvious how well the connectivity measure or the observed benefits translate to messier, real-world task structures or to human data, so the broader claims about cognitive flexibility or AI design stay suggestive. The work is for people working on multi-task learning, cognitive modeling, or attention mechanisms who want to think about environment statistics as a lever. A reader already interested in how task overlap affects transfer would find the comparisons useful. It is coherent and grounded enough to deserve peer review; the empirical result is straightforward and the modeling choices are transparent, even if the artificial setting will prompt questions about scope.

Referee Report

0 major / 3 minor

Summary. The paper designs a multi-task learning environment where tasks are defined by combinations of two cue dimensions and characterized using graph-theory methods. It introduces gating-based (multiplicative) and concatenation-based attention models that decompose tasks into components and sequentially allocate attention. These are compared to multilayer perceptrons across environments varying in richness and task connectivity. The central empirical observations are that richer environments improve both generalization and stability, and that graph-theoretic task connectivity strongly modulates both outcomes, with especially pronounced benefits for the attention-based models over MLPs.

Significance. If the simulation results hold, the work makes a useful contribution by shifting focus from architecture alone to the interaction between environmental structure (quantified via graphs) and model type in achieving cognitive flexibility. The graph-based characterization of task connectivity provides a systematic, quantifiable way to vary the environment, and the finding that attention models leverage this structure more effectively than MLPs is a clear empirical observation within the defined setup. This could inform both AI multi-task learning and cognitive modeling.

minor comments (3)

The abstract and introduction would benefit from an explicit statement of the precise graph metrics (e.g., degree, clustering coefficient, or path length) used to quantify task connectivity and how 'richness' is operationalized in the cue-dimension graphs.
Ensure that the results section reports the number of independent runs, random seeds, and any statistical tests or confidence intervals supporting the claims about modulation by connectivity; this is needed to assess the reliability of the 'especially pronounced benefits' for attention models.
Standardize notation for the two attention variants (gating vs. concatenation) across text, equations, and figures to avoid potential reader confusion when comparing their performance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript, including the recognition that the graph-theoretic characterization of task connectivity and its interaction with attention models represents a useful contribution. We appreciate the recommendation for minor revision and will address any specific editorial or presentational points in the revised version.

Circularity Check

0 steps flagged

No circularity: purely empirical simulation study

full rationale

The paper describes the design of an artificial multi-task environment using cue dimensions and graph connectivity, introduces gating- and concatenation-based attention models, compares them to MLPs via simulations, and reports observed effects of environment richness and task connectivity on stability and generalization. No equations, derivations, parameter fits, or predictions are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing for any uniqueness theorem or ansatz. The central claims are direct empirical observations within the constructed setup, which are self-contained and externally falsifiable through replication of the simulations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical computational modeling study; the abstract describes no free parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5541 in / 1116 out tokens · 55415 ms · 2026-05-10T13:16:28.031883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 7 canonical work pages

[1]

Continual evaluation for lifelong learning: Identifying the stability gap

De Lange M, van de Ven GM, Tuytelaars T. Continual evaluation for lifelong learning: Identifying the stability gap. In: ICLR; 2023. p. 1-21

2023
[2]

Catastrophic forgetting in connectionists networks

French RM. Catastrophic forgetting in connectionists networks. Trends in Cognitive Sciences. 1999;3(4):128-35

1999
[3]

How does a brain build a cognitive code? Psychological review

Grossberg S. How does a brain build a cognitive code? Psychological review. 1980 Jan;87(1):1-51

1980
[4]

De Lange, R

De Lange M, Aljundi R, Masana M, Parisot S, Jia X, Leonardis A, et al. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(7):3366-85. doi:10.1109/TPAMI.2021.3057446

work page doi:10.1109/tpami.2021.3057446 2022
[5]

URLhttps: //www.sciencedirect.com/science/article/pii/S0079742108605368

McCloskey M, Cohen NJ. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. vol. 24 of Psychology of Learning and Motivation. Academic Press; 1989. p. 109-65. doi:https://doi.org/10.1016/S0079-7421(08)60536-8

work page doi:10.1016/s0079-7421(08)60536-8 1989
[6]

On the Stability-Plasticity Dilemma of Class-Incremental Learning; 2023

Kim D, Han B. On the Stability-Plasticity Dilemma of Class-Incremental Learning; 2023. arXiv:2304.01663

work page arXiv 2023
[7]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

McClelland JL, McNaughton BL, O’Reilly RC. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review. 1995;102(3):419

1995
[8]

When does compositional structure yield compositional generalization? A kernel theory

Lippl S, Stachenfeld K. When does compositional structure yield compositional generalization? A kernel theory. In: The Thirteenth International Conference on Learning Representations; 2025. Available from: https://openreview.net/forum?id=FPBce2P1er. April 16, 2026 24/28

2025
[9]

Rationalizing constraints on the capacity for cognitive control

Musslick S, Cohen JD. Rationalizing constraints on the capacity for cognitive control. Trends in cognitive sciences. 2021;25(9):757-75

2021
[10]

Humans decompose tasks by trading off utility and computational cost

Correa CG, Ho MK, Callaway F, Daw ND, Griffiths TL. Humans decompose tasks by trading off utility and computational cost. PLOS Computational Biology. 2023 Jun;19(6):e1011087

2023
[11]

Overcoming catastrophic forgetting in neural networks

Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences. 2017;114(13):3521-6

2017
[12]

Continual Learning with Deep Generative Replay

Shin H, Lee JK, Kim J, Kim J. Continual Learning with Deep Generative Replay. CoRR. 2017;abs/1705.08690. Available from: http://arxiv.org/abs/1705.08690. arXiv:1705.08690

work page arXiv 2017
[13]

Learning to synchronize: How biological agents can couple neural task modules for dealing with the stability-plasticity dilemma

Verbeke P, Verguts T. Learning to synchronize: How biological agents can couple neural task modules for dealing with the stability-plasticity dilemma. PLoS computational biology. 2019

2019
[14]

Thalamus: a brain-inspired algorithm for biologically-plausible continual learning and disentangled representations

Hummos A. Thalamus: a brain-inspired algorithm for biologically-plausible continual learning and disentangled representations. In: The Eleventh International Conference on Learning Representations; 2023

2023
[15]

Sparks of cognitive flexibility: self-guided context inference for flexible stimulus-response mapping by attentional routing

Sommers RP, Thorat S, Anthes D, Kietzmann TC. Sparks of cognitive flexibility: self-guided context inference for flexible stimulus-response mapping by attentional routing. arXiv preprint arXiv:250215634. 2025

2025
[16]

Using top-down modulation to optimally balance shared versus separated task representations

Verbeke P, Verguts T. Using top-down modulation to optimally balance shared versus separated task representations. Neural Networks. 2022;146:256-71. doi:https://doi.org/10.1016/j.neunet.2021.11.030

work page doi:10.1016/j.neunet.2021.11.030 2022
[17]

Range, not Independence, Drives Modularity in Biologically Inspired Representations; 2025

Dorrell W, Hsu K, Hollingsworth L, Lee JH, Wu J, Finn C, et al.. Range, not Independence, Drives Modularity in Biologically Inspired Representations; 2025. arXiv:2410.06232

work page arXiv 2025
[18]

Abstract representations emerge naturally in neural networks trained to perform multiple tasks

Johnston WJ, Fusi S. Abstract representations emerge naturally in neural networks trained to perform multiple tasks. Nature Communications. 2023;14(1):1040

2023
[19]

Curriculum learning for human compositional generalization

Dekker RB, Otto F, Summerfield C. Curriculum learning for human compositional generalization. Proceeding of the national academy of science. 2022;119:1-12

2022
[20]

Humans and neural networks show similar patterns of transfer and interference during continual learning

Holton E, Braun L, Thompson JA, Grohn J, Summerfield C. Humans and neural networks show similar patterns of transfer and interference during continual learning. Nature Human Behaviour. 2026;10(1):111-25

2026
[21]

A mathematical theory of semantic development in deep neural networks

Saxe AM, McClelland JL, Ganguli S. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences. 2019;116(23):11537-46. arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.1820226116

work page doi:10.1073/pnas.1820226116 2019
[22]

Maslow’s Hammer in Catastrophic Forgetting: Node Re-Use vs

Lee S, Mannelli SS, Clopath C, Goldt S, Saxe A. Maslow’s Hammer in Catastrophic Forgetting: Node Re-Use vs. Node Activation. In: International Conference on Machine Learning. PMLR; 2022. p. 12455-77

2022
[23]

Collective dynamics of ‘small-world’networks

Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’networks. nature. 1998;393(6684):440-2. April 16, 2026 25/28

1998
[24]

Task representations in neural networks trained to perform many cognitive tasks

Yang GR, Joglekar MR, Song HF, Newsome WT, Wang Xj. Task representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience. 2019;22(February)

2019
[25]

No free lunch theorems for optimization

Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE transactions on evolutionary computation. 2002;1(1):67-82

2002
[26]

On the control of automatic processes: a parallel distributed processing account of the Stroop effect

Cohen JD, Dunbar K, McClelland JL. On the control of automatic processes: a parallel distributed processing account of the Stroop effect. Psychological review. 1990;97(3):332

1990
[27]

Attention is all you need

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;2017-Decem(Nips):5999-6009

2017
[28]

A mathematical framework for transformer circuits

Elhage N, Nanda N, Olsson C, Henighan T, Joseph N, Mann B, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread. 2021;1(1):12

2021
[29]

The role of Disentanglement in Generalisation

Montero ML, Ludwig CJ, Costa RP, Malhotra G, Bowers J. The role of Disentanglement in Generalisation. In: International Conference on Learning Representations; 2021. Available from: https://openreview.net/forum?id=qbH974jKUVy

2021
[30]

Curriculum learning

Bengio Y, Louradour J, Collobert R, Weston J. Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09. New York, NY, USA: Association for Computing Machinery; 2009. p. 41–48

2009
[31]

When Do Curricula Work? In: International Conference on Learning Representations; 2021

Wu X, Dyer E, Neyshabur B. When Do Curricula Work? In: International Conference on Learning Representations; 2021

2021
[32]

Teacher-Student Curriculum Learning

Matiisen T, Oliver A, Cohen T, Schulman J. Teacher-Student Curriculum Learning. IEEE Transactions on Neural Networks and Learning Systems. 2017 07;PP

2017
[33]

Compositional clustering in task structure learning

Franklin NT, Frank MJ. Compositional clustering in task structure learning. PLoS Computational Biology. 2018:1-25

2018
[34]

Multi-task reinforcement learning in humans

Tomov MS, Schulz E, Gershman SJ. Multi-task reinforcement learning in humans. Nature Human Behaviour. 2021 Jan;5(6):764-73

2021
[35]

Skill characterization based on betweenness

S ¸im¸ sek¨O, Barto A. Skill characterization based on betweenness. Advances in neural information processing systems. 2008;21

2008
[36]

The hippocampus as a predictive map

Stachenfeld KL, Botvinick MM, Gershman SJ. The hippocampus as a predictive map. Nature neuroscience. 2017;20(11):1643-53

2017
[37]

Frontal cortex and the discovery of abstract action rules

Badre D, Kayser AS, D’Esposito M. Frontal cortex and the discovery of abstract action rules. Neuron. 2010;66(2):315-26

2010
[38]

Constructive enumeration of combinatorial objects

Faradˇ zev I. Constructive enumeration of combinatorial objects. In: Probl` emes combinatoires et th´ eorie des graphes; 1978. p. 131-5

1978
[39]

TensorFlow: A system for large-scale machine learning

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. TensorFlow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation; 2016. p. 265-83. April 16, 2026 26/28 Supporting information In this appendix, we provide the full figures for all six models, including the models MLP 1, Gate 1 and Con...

2016