Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning

Julian Gutheil (1); Robert Legenstein (1) ((1) Graz University of Technology); Simon Hitzginger (1)

arxiv: 2605.22472 · v1 · pith:AD3WXFVEnew · submitted 2026-05-21 · 💻 cs.LG

Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning

Julian Gutheil (1) , Simon Hitzginger (1) , Robert Legenstein (1) ((1) Graz University of Technology) This is my paper

Pith reviewed 2026-05-22 06:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords winner-take-alldisentangled representationssymbolic representationsmulti-task learninglatent factorsneural networksgeneralization

0 comments

The pith

A winner-take-all bottleneck enforces disentangled symbolic representations in multi-task neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that inserting a winner-take-all bottleneck into a deep neural network forces the extraction of categorical latent factors during multi-task learning. The resulting internal representation becomes highly symbolic, with individual neurons or small groups of neurons each encoding one abstract feature such as an object, color, or position. A sympathetic reader would care because the mechanism supplies a concrete route by which neural networks can produce interpretable, factorized codes that support stronger generalization. The authors supply a proof under stated conditions on data and architecture and then demonstrate the same behavior empirically on two datasets even when those conditions are only approximately met.

Core claim

A WTA bottleneck within a deep neural network can enforce under certain well-defined conditions the extraction of categorical latent factors of the data in a multi-task learning setup. In particular, the representation that emerges in the WTA bottleneck is highly symbolic, where a single neuron or a population of neurons encodes the presence of a single abstract feature such as a specific object, color, or position. Empirical results confirm advantages for generalization on two datasets even when architectures deviate from the theorem assumptions.

What carries the argument

The winner-take-all (WTA) bottleneck, which suppresses all but the strongest activations to isolate one categorical factor at a time from otherwise entangled inputs.

If this is right

The symbolic codes improve generalization across the tasks the network is trained on.
Individual neurons become dedicated encoders for single abstract features.
The same benefits appear in networks whose details fall short of the exact conditions required by the theorem.
The resulting representation acts as a bridge between subsymbolic neural computation and symbolic reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mechanisms resembling WTA, such as softmax in attention layers, may be contributing to feature isolation in large transformers.
Inserting similar bottlenecks into other training regimes could produce more interpretable models without requiring full redesign.
Relaxing the current conditions in follow-up theory would clarify how broadly the effect applies to real-world data.

Load-bearing premise

The data distribution and network architecture must satisfy conditions that let the WTA operation cleanly separate categorical factors without residual mixing from other variables.

What would settle it

If a network equipped with a WTA bottleneck is trained on multi-task data known to contain independent categorical factors and the bottleneck layer still shows mixed or distributed encodings instead of single-factor neurons, the enforcement claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.22472 by Julian Gutheil (1), Robert Legenstein (1) ((1) Graz University of Technology), Simon Hitzginger (1).

**Figure 1.** Figure 1: Theoretical framework. a) Network setup, showing representations (blue) and mappings (olive). Latent vector z (bottom) examplified for two latent variables with three categories each. Each latent is coded as a one-hot vector. The latents are mapped through an injective function Φ to an entangled represenation x. The WTA encoder fenc maps x to a representation zˆ, which is constrained by a multi-WTA head. A… view at source ↗

**Figure 2.** Figure 2: Emergence of symbolic representations and generalization behavior. a) Outputs of WTA heads and their relation to latent variable values for one example model after multi-task learning. The x-axis is organized by outputs of WTA heads (10 per WTA), the y-axis by categories of latent variables. Note that the number of categories differed for different latents. A dot indicates that the corresponding WTA output… view at source ↗

**Figure 3.** Figure 3: Symbolic representation and generalization behavior on visual input. a) Symbolic representation for two example inputs. Top: Input image x. Bottom: For each latent factor z (j) , the input category that generated the sample and the corresponding WTA output zˆ (i) after applying the mapping from WTA outputs to latent categories described in Appendix A.3. In each row, the active input category is highlighted… view at source ↗

read the original abstract

Winner-take-all (WTA) networks constitute a central circuit motif in cortical networks of the brain. In addition, WTA-like activations are abundant in modern deep learning models in the form of the softmax activation for example in attention layers of transformers. While their role in the extraction of latent factors has been studied for relatively simple generative models, their role in the context of highly non-linearly entangled latent factors has remained elusive. In this article, we show that a WTA bottleneck within a deep neural network can enforce under certain well-defined conditions the extraction of categorical latent factors of the data in a multi-task learning setup. In particular, we prove that the representation that emerges in the WTA bottleneck is highly symbolic, where a single neuron or a population of neurons encodes the presence of a single abstract feature such as a specific object, color, or position. We furthermore show empirically on two datasets, that this also holds for architectures and setups that do not fully comply with the assumptions of our theorem and demonstrate the advantages of the acquired symbolic representation for generalization. Our proposed model provides insights into the generalization capabilities of deep neural networks with WTA-like components and may serve as an interface between symbolic and subsymbolic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a winner-take-all (WTA) bottleneck within a deep neural network provably extracts categorical latent factors under well-defined conditions on the data distribution and multi-task architecture, yielding highly symbolic representations in which single neurons or populations encode individual abstract features such as objects, colors, or positions. It supports the claim with a mathematical proof and empirical results on two datasets that demonstrate generalization advantages even when the theorem's assumptions are not fully met.

Significance. If the link between the theorem and the empirical regimes can be made rigorous, the result would supply a principled account of how WTA-like components promote disentangled representations and improved generalization in multi-task learning, offering a potential interface between subsymbolic deep networks and symbolic AI.

major comments (2)

[§3] §3 (Theorem statement): The proof requires strictly independent categorical latents, one-to-one task-factor alignment, and exact (non-leaky) WTA activation. The manuscript explicitly states that the two empirical datasets and architectures do not fully satisfy these conditions, yet still attributes the observed symbolic representations and generalization gains to the mechanism isolated by the theorem. Without a quantitative continuity argument or sensitivity analysis showing that moderate violations preserve the single-neuron encoding property, the extrapolation from proof to practice rests on an unverified assumption.
[§5] §5 (Empirical evaluation): The claim that the WTA bottleneck produces 'highly symbolic' representations is supported primarily by task performance and qualitative inspection. No explicit metric (e.g., mutual information between bottleneck units and ground-truth factors, or ablation isolating the exact-WTA effect from standard multi-task regularization) is reported to confirm that the generalization advantage arises from the symbolic mechanism rather than generic regularization.

minor comments (2)

[§2] The relation between the exact WTA operator used in the theorem and the softmax approximation employed in the experiments should be stated with a precise mathematical comparison.
[Figures] Figure captions and axis labels in the representation visualizations could more explicitly indicate which abstract feature each neuron is claimed to encode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Theorem statement): The proof requires strictly independent categorical latents, one-to-one task-factor alignment, and exact (non-leaky) WTA activation. The manuscript explicitly states that the two empirical datasets and architectures do not fully satisfy these conditions, yet still attributes the observed symbolic representations and generalization gains to the mechanism isolated by the theorem. Without a quantitative continuity argument or sensitivity analysis showing that moderate violations preserve the single-neuron encoding property, the extrapolation from proof to practice rests on an unverified assumption.

Authors: We agree that the theorem relies on strict assumptions (independent categorical latents, one-to-one alignment, and exact WTA) and that the empirical datasets do not fully satisfy them, as already noted in the manuscript. To address the lack of a quantitative continuity argument, we will add a sensitivity analysis in the revised version. This will include both a theoretical discussion of approximate satisfaction of the assumptions and controlled empirical perturbations to quantify how moderate violations affect the single-neuron encoding property and generalization performance. revision: yes
Referee: [§5] §5 (Empirical evaluation): The claim that the WTA bottleneck produces 'highly symbolic' representations is supported primarily by task performance and qualitative inspection. No explicit metric (e.g., mutual information between bottleneck units and ground-truth factors, or ablation isolating the exact-WTA effect from standard multi-task regularization) is reported to confirm that the generalization advantage arises from the symbolic mechanism rather than generic regularization.

Authors: We agree that explicit quantitative metrics and ablations would provide stronger support for attributing the results to the symbolic mechanism. In the revision, we will report mutual information between bottleneck units and ground-truth factors. We will also add an ablation comparing the full WTA model against a multi-task baseline without the WTA bottleneck (while keeping other regularization effects matched) to isolate the contribution of the symbolic representation to generalization gains. revision: yes

Circularity Check

0 steps flagged

No circularity: theorem is independent mathematical result; empirical extension does not reduce to fitted inputs or self-citation

full rationale

The paper derives its central claim via a stated mathematical proof that a WTA bottleneck yields symbolic representations under explicitly listed conditions on data distribution and architecture. The abstract and description note that empirical datasets do not fully satisfy those conditions yet still exhibit the property, presented as an additional observation rather than a prediction forced by the theorem. No equations or steps reduce a claimed prediction to a fitted parameter by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is used to justify the derivation. The proof and experiments remain self-contained against external benchmarks with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on mathematical conditions for the theorem and empirical relaxation of those conditions; no explicit free parameters, new axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5759 in / 1089 out tokens · 31159 ms · 2026-05-22T06:56:07.558046+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We proof that, when the network was trained to perfectly solve a large enough number of linear classification tasks on the latent factors z, then the representation ˆz of the WTA bottleneck is a permutation of z that is further constrained by the structure of the WTA bottleneck (referred to as structured permutation).
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1. ... the columns of ˆC are a structured permutation of the columns of C.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 6 internal anchors

[1]

A canonical microcircuit for neocortex.Neural computation, 1(4):480–488, 1989

Rodney J Douglas, Kevan AC Martin, and David Whitteridge. A canonical microcircuit for neocortex.Neural computation, 1(4):480–488, 1989

work page 1989
[2]

Microcircuits of excitatory and inhibitory neurons in layer 2/3 of mouse barrel cortex.Journal of neurophysiology, 107(11):3116–3134, 2012

Michael Avermann, Christian Tomm, Celine Mateo, Wulfram Gerstner, and Carl CH Petersen. Microcircuits of excitatory and inhibitory neurons in layer 2/3 of mouse barrel cortex.Journal of neurophysiology, 107(11):3116–3134, 2012

work page 2012
[3]

Stdp enables spiking neurons to detect hidden causes of their inputs.Advances in neural information processing systems, 22, 2009

Bernhard Nessler, Michael Pfeiffer, and Wolfgang Maass. Stdp enables spiking neurons to detect hidden causes of their inputs.Advances in neural information processing systems, 22, 2009

work page 2009
[4]

A probabilistic model for learning in cortical microcircuit motifs with data-based divisive inhibition

Robert Legenstein, Zeno Jonke, Stefan Habenschuss, and Wolfgang Maass. A probabilistic model for learning in cortical microcircuit motifs with data-based divisive inhibition.arXiv preprint arXiv:1707.05182, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Feedback inhi- bition shapes emergent computational properties of cortical microcircuit motifs.Journal of Neuroscience, 37(35):8511–8523, 2017

Zeno Jonke, Robert Legenstein, Stefan Habenschuss, and Wolfgang Maass. Feedback inhi- bition shapes emergent computational properties of cortical microcircuit motifs.Journal of Neuroscience, 37(35):8511–8523, 2017

work page 2017
[6]

Feature discovery by competitive learning.Cognitive science, 9(1):75–112, 1985

David E Rumelhart and David Zipser. Feature discovery by competitive learning.Cognitive science, 9(1):75–112, 1985

work page 1985
[7]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[8]

Towards a Definition of Disentangled Representations

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations.arXiv preprint arXiv:1812.02230, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Abstract representations emerge naturally in neural networks trained to perform multiple tasks.Nature Communications, 14(1):1040, 2023

W Jeffrey Johnston and Stefano Fusi. Abstract representations emerge naturally in neural networks trained to perform multiple tasks.Nature Communications, 14(1):1040, 2023. 10

work page 2023
[10]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798– 1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798– 1828, 2013

work page 2013
[11]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. Ininternational conference on machine learning, pages 4114–

work page
[12]

Feature discovery by competitive learning

David E Rumelhart and David Zipser. Feature discovery by competitive learning. InParallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations, pages 151–193. MIT Press, 1986

work page 1986
[13]

Distributed bayesian computation and self-organized learning in sheets of spiking neurons with local lateral inhibition.PloS one, 10(8):e0134356, 2015

Johannes Bill, Lars Buesing, Stefan Habenschuss, Bernhard Nessler, Wolfgang Maass, and Robert Legenstein. Distributed bayesian computation and self-organized learning in sheets of spiking neurons with local lateral inhibition.PloS one, 10(8):e0134356, 2015

work page 2015
[14]

Hierarchical models of object recognition in cortex.Nature neuroscience, 2(11):1019–1025, 1999

Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex.Nature neuroscience, 2(11):1019–1025, 1999

work page 1999
[15]

Long term memory and the densest k-subgraph problem

Robert Legenstein, Wolfgang Maass, Christos H Papadimitriou, and Santosh S Vempala. Long term memory and the densest k-subgraph problem. In9th Innovations in Theoretical Com- puter Science Conference (ITCS 2018), pages 57–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018

work page 2018
[16]

On the computational power of winner-take-all.Neural computation, 12(11):2519–2535, 2000

Wolfgang Maass. On the computational power of winner-take-all.Neural computation, 12(11):2519–2535, 2000

work page 2000
[17]

Disentangled representation learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9677–9696, 2024

Xin Wang, Hong Chen, Si’ao Tang, Zihao Wu, and Wenwu Zhu. Disentangled representation learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9677–9696, 2024

work page 2024
[18]

Disentangling representations through multi-task learning

Pantelis Vafidis, Aman Bhargava, and Antonio Rangel. Disentangling representations through multi-task learning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Computational role of structure in neural activity and connec- tivity.Trends in Cognitive Sciences, 28(7):677–690, 2024

Srdjan Ostojic and Stefano Fusi. Computational role of structure in neural activity and connec- tivity.Trends in Cognitive Sciences, 28(7):677–690, 2024

work page 2024
[20]

Categorical reparameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. InInternational Conference on Learning Representations, 2017

work page 2017
[21]

An introduction to roc analysis.Pattern Recognition Letters, 27(8):861–874,

Tom Fawcett. An introduction to roc analysis.Pattern Recognition Letters, 27(8):861–874,

work page
[22]

ROC Analysis in Pattern Recognition

work page
[23]

dsprites: Disentangle- ment testing sprites dataset

Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle- ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

work page 2017
[24]

Understanding disentangling in $\beta$-VAE

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des- jardins, and Alexander Lerchner. Understanding disentangling in β-VAE.arXiv preprint arXiv:1804.03599, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

work page 2023
[26]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[27]

Predictive learning: its key role in early cognitive development.Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1771), 2019

Yukie Nagai. Predictive learning: its key role in early cognitive development.Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1771), 2019

work page 2019
[28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 11

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

k-Sparse Autoencoders

Alireza Makhzani and Brendan Frey. K-sparse autoencoders.arXiv preprint arXiv:1312.5663, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Winner-take-all autoencoders.Advances in neural information processing systems, 28, 2015

Alireza Makhzani and Brendan J Frey. Winner-take-all autoencoders.Advances in neural information processing systems, 28, 2015

work page 2015
[31]

Sparse coding of sensory inputs.Current opinion in neurobiology, 14(4):481–487, 2004

Bruno A Olshausen and David J Field. Sparse coding of sensory inputs.Current opinion in neurobiology, 14(4):481–487, 2004

work page 2004
[32]

Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

work page 2005
[33]

The importance of mixed selectivity in complex cognitive tasks

Mattia Rigotti, Omri Barak, Melissa R Warden, Xiao-Jing Wang, Nathaniel D Daw, Earl K Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451):585–590, 2013

work page 2013
[34]

The neural binding problem (s).Cognitive neurodynamics, 7(1):1–11, 2013

Jerome Feldman. The neural binding problem (s).Cognitive neurodynamics, 7(1):1–11, 2013

work page 2013
[35]

A model for structured information representation in neural networks of the brain.eneuro, 7(3), 2020

Michael G Müller, Christos H Papadimitriou, Wolfgang Maass, and Robert Legenstein. A model for structured information representation in neural networks of the brain.eneuro, 7(3), 2020

work page 2020
[36]

Pytorch documentation linear layer

Pytorch Contributors. Pytorch documentation linear layer. https://docs.pytorch.org/docs/2.11/generated/torch.nn.Linear.html, 2026. Accessed: 2026-05-05

work page 2026
[37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[38]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019

work page 2019
[39]

Curran Associates Inc., Red Hook, NY , USA, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.PyTorch: an imperative style, high-performan...

work page 2019
[40]

Pytorch lightning

William Falcon and The PyTorch Lightning team. Pytorch lightning. https://doi.org/10. 5281/zenodo.15053754, March 2025

work page 2025
[41]

Hydra - a framework for elegantly configuring complex applications

Omry Yadan. Hydra - a framework for elegantly configuring complex applications. https: //github.com/facebookresearch/hydra, 2019. 12 A Technical appendices and supplementary material A.1 Definitions and derivations for Section 3 Definition of symbolic representations:Consider a representation ˆz= (ˆz1,. . ., ˆzl′)⊤ ∈ {0, 1}l′ and a total latent vectorz= (...

work page 2019
[42]

# solved perfectly

jointly represented two categories of the second latent factorz(2). If one of these two neurons was active, the value of z(2) could not be decoded unambiguously. This ambiguity could be resolved with the help of WTA headˆz(2), where selective conditional activations can be seen for the two ambiguous categories ofz (2). Table A3: Train and test performance...

work page 2000

[1] [1]

A canonical microcircuit for neocortex.Neural computation, 1(4):480–488, 1989

Rodney J Douglas, Kevan AC Martin, and David Whitteridge. A canonical microcircuit for neocortex.Neural computation, 1(4):480–488, 1989

work page 1989

[2] [2]

Microcircuits of excitatory and inhibitory neurons in layer 2/3 of mouse barrel cortex.Journal of neurophysiology, 107(11):3116–3134, 2012

Michael Avermann, Christian Tomm, Celine Mateo, Wulfram Gerstner, and Carl CH Petersen. Microcircuits of excitatory and inhibitory neurons in layer 2/3 of mouse barrel cortex.Journal of neurophysiology, 107(11):3116–3134, 2012

work page 2012

[3] [3]

Stdp enables spiking neurons to detect hidden causes of their inputs.Advances in neural information processing systems, 22, 2009

Bernhard Nessler, Michael Pfeiffer, and Wolfgang Maass. Stdp enables spiking neurons to detect hidden causes of their inputs.Advances in neural information processing systems, 22, 2009

work page 2009

[4] [4]

A probabilistic model for learning in cortical microcircuit motifs with data-based divisive inhibition

Robert Legenstein, Zeno Jonke, Stefan Habenschuss, and Wolfgang Maass. A probabilistic model for learning in cortical microcircuit motifs with data-based divisive inhibition.arXiv preprint arXiv:1707.05182, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Feedback inhi- bition shapes emergent computational properties of cortical microcircuit motifs.Journal of Neuroscience, 37(35):8511–8523, 2017

Zeno Jonke, Robert Legenstein, Stefan Habenschuss, and Wolfgang Maass. Feedback inhi- bition shapes emergent computational properties of cortical microcircuit motifs.Journal of Neuroscience, 37(35):8511–8523, 2017

work page 2017

[6] [6]

Feature discovery by competitive learning.Cognitive science, 9(1):75–112, 1985

David E Rumelhart and David Zipser. Feature discovery by competitive learning.Cognitive science, 9(1):75–112, 1985

work page 1985

[7] [7]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[8] [8]

Towards a Definition of Disentangled Representations

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations.arXiv preprint arXiv:1812.02230, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Abstract representations emerge naturally in neural networks trained to perform multiple tasks.Nature Communications, 14(1):1040, 2023

W Jeffrey Johnston and Stefano Fusi. Abstract representations emerge naturally in neural networks trained to perform multiple tasks.Nature Communications, 14(1):1040, 2023. 10

work page 2023

[10] [10]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798– 1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798– 1828, 2013

work page 2013

[11] [11]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. Ininternational conference on machine learning, pages 4114–

work page

[12] [12]

Feature discovery by competitive learning

David E Rumelhart and David Zipser. Feature discovery by competitive learning. InParallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations, pages 151–193. MIT Press, 1986

work page 1986

[13] [13]

Distributed bayesian computation and self-organized learning in sheets of spiking neurons with local lateral inhibition.PloS one, 10(8):e0134356, 2015

Johannes Bill, Lars Buesing, Stefan Habenschuss, Bernhard Nessler, Wolfgang Maass, and Robert Legenstein. Distributed bayesian computation and self-organized learning in sheets of spiking neurons with local lateral inhibition.PloS one, 10(8):e0134356, 2015

work page 2015

[14] [14]

Hierarchical models of object recognition in cortex.Nature neuroscience, 2(11):1019–1025, 1999

Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex.Nature neuroscience, 2(11):1019–1025, 1999

work page 1999

[15] [15]

Long term memory and the densest k-subgraph problem

Robert Legenstein, Wolfgang Maass, Christos H Papadimitriou, and Santosh S Vempala. Long term memory and the densest k-subgraph problem. In9th Innovations in Theoretical Com- puter Science Conference (ITCS 2018), pages 57–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018

work page 2018

[16] [16]

On the computational power of winner-take-all.Neural computation, 12(11):2519–2535, 2000

Wolfgang Maass. On the computational power of winner-take-all.Neural computation, 12(11):2519–2535, 2000

work page 2000

[17] [17]

Disentangled representation learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9677–9696, 2024

Xin Wang, Hong Chen, Si’ao Tang, Zihao Wu, and Wenwu Zhu. Disentangled representation learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9677–9696, 2024

work page 2024

[18] [18]

Disentangling representations through multi-task learning

Pantelis Vafidis, Aman Bhargava, and Antonio Rangel. Disentangling representations through multi-task learning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

Computational role of structure in neural activity and connec- tivity.Trends in Cognitive Sciences, 28(7):677–690, 2024

Srdjan Ostojic and Stefano Fusi. Computational role of structure in neural activity and connec- tivity.Trends in Cognitive Sciences, 28(7):677–690, 2024

work page 2024

[20] [20]

Categorical reparameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. InInternational Conference on Learning Representations, 2017

work page 2017

[21] [21]

An introduction to roc analysis.Pattern Recognition Letters, 27(8):861–874,

Tom Fawcett. An introduction to roc analysis.Pattern Recognition Letters, 27(8):861–874,

work page

[22] [22]

ROC Analysis in Pattern Recognition

work page

[23] [23]

dsprites: Disentangle- ment testing sprites dataset

Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle- ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

work page 2017

[24] [24]

Understanding disentangling in $\beta$-VAE

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des- jardins, and Alexander Lerchner. Understanding disentangling in β-VAE.arXiv preprint arXiv:1804.03599, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

work page 2023

[26] [26]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[27] [27]

Predictive learning: its key role in early cognitive development.Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1771), 2019

Yukie Nagai. Predictive learning: its key role in early cognitive development.Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1771), 2019

work page 2019

[28] [28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 11

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

k-Sparse Autoencoders

Alireza Makhzani and Brendan Frey. K-sparse autoencoders.arXiv preprint arXiv:1312.5663, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

Winner-take-all autoencoders.Advances in neural information processing systems, 28, 2015

Alireza Makhzani and Brendan J Frey. Winner-take-all autoencoders.Advances in neural information processing systems, 28, 2015

work page 2015

[31] [31]

Sparse coding of sensory inputs.Current opinion in neurobiology, 14(4):481–487, 2004

Bruno A Olshausen and David J Field. Sparse coding of sensory inputs.Current opinion in neurobiology, 14(4):481–487, 2004

work page 2004

[32] [32]

Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

work page 2005

[33] [33]

The importance of mixed selectivity in complex cognitive tasks

Mattia Rigotti, Omri Barak, Melissa R Warden, Xiao-Jing Wang, Nathaniel D Daw, Earl K Miller, and Stefano Fusi. The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451):585–590, 2013

work page 2013

[34] [34]

The neural binding problem (s).Cognitive neurodynamics, 7(1):1–11, 2013

Jerome Feldman. The neural binding problem (s).Cognitive neurodynamics, 7(1):1–11, 2013

work page 2013

[35] [35]

A model for structured information representation in neural networks of the brain.eneuro, 7(3), 2020

Michael G Müller, Christos H Papadimitriou, Wolfgang Maass, and Robert Legenstein. A model for structured information representation in neural networks of the brain.eneuro, 7(3), 2020

work page 2020

[36] [36]

Pytorch documentation linear layer

Pytorch Contributors. Pytorch documentation linear layer. https://docs.pytorch.org/docs/2.11/generated/torch.nn.Linear.html, 2026. Accessed: 2026-05-05

work page 2026

[37] [37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[38] [38]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019

work page 2019

[39] [39]

Curran Associates Inc., Red Hook, NY , USA, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.PyTorch: an imperative style, high-performan...

work page 2019

[40] [40]

Pytorch lightning

William Falcon and The PyTorch Lightning team. Pytorch lightning. https://doi.org/10. 5281/zenodo.15053754, March 2025

work page 2025

[41] [41]

Hydra - a framework for elegantly configuring complex applications

Omry Yadan. Hydra - a framework for elegantly configuring complex applications. https: //github.com/facebookresearch/hydra, 2019. 12 A Technical appendices and supplementary material A.1 Definitions and derivations for Section 3 Definition of symbolic representations:Consider a representation ˆz= (ˆz1,. . ., ˆzl′)⊤ ∈ {0, 1}l′ and a total latent vectorz= (...

work page 2019

[42] [42]

# solved perfectly

jointly represented two categories of the second latent factorz(2). If one of these two neurons was active, the value of z(2) could not be decoded unambiguously. This ambiguity could be resolved with the help of WTA headˆz(2), where selective conditional activations can be seen for the two ambiguous categories ofz (2). Table A3: Train and test performance...

work page 2000