Training capsules as a routing-weighted product of expert neurons

Michael Hauser

arxiv: 1907.11639 · v1 · pith:SSY5FWSRnew · submitted 2019-07-26 · 💻 cs.NE · cs.LG· stat.ML

Training capsules as a routing-weighted product of expert neurons

Michael Hauser This is my paper

Pith reviewed 2026-05-24 15:04 UTC · model grok-4.3

classification 💻 cs.NE cs.LGstat.ML

keywords capsule networksdynamic routingproduct of expertscontrastive divergenceenergy functionunsupervised learninggenerative models

0 comments

The pith

Capsule networks with dynamic routing can be formulated as a product of expert neurons and trained unsupervised via contrastive divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats capsules as collections of neurons inside a fully connected network, with sub-networks between capsules weighted by the routing coefficients from routing-by-agreement. An energy function is constructed to match this weighted structure, which directly implies that the capsule network equals a product of expert neurons. Alternating dynamic routing steps with gradient updates on the contrastive divergence of the energy function produces a bottom-up unsupervised training procedure. The resulting model is shown to generate realistic images on standard vision datasets.

Core claim

Capsule networks with dynamic routing can be formulated as a product of expert neurons. An energy function is designed to reflect this model, and it follows that capsule networks with dynamic routing can be formulated as a product of expert neurons. By alternating between dynamic routing, which acts to both find subnetworks within the overall network as well as to mix the model distribution, and updating the parameters by the gradient of the contrastive divergence, a bottom-up, unsupervised learning algorithm is constructed for capsule networks with dynamic routing.

What carries the argument

The energy function designed to reflect routing-by-agreement weighted sub-networks inside a fully connected network, which converts the capsule model into a product of experts.

If this is right

Dynamic routing both identifies sub-networks and mixes the model distribution during training.
Parameters can be updated without labeled data using only the gradient of contrastive divergence.
The trained model generates realistic images from standard vision datasets in a purely unsupervised manner.
The approach constructs a bottom-up learning algorithm for capsule networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The formulation may allow capsule networks to inherit training techniques developed for other energy-based product-of-experts models.
Routing coefficients could be interpreted directly as expert-selection probabilities, potentially improving interpretability of the learned sub-networks.
The same energy-function construction might extend to other routing mechanisms beyond routing-by-agreement.

Load-bearing premise

The designed energy function correctly captures the effect of routing-by-agreement weights on the sub-networks inside the fully connected network.

What would settle it

If alternating dynamic routing with contrastive-divergence updates on the parameters fails to produce a generative model that outputs realistic images from standard vision datasets, the product-of-experts formulation would not hold.

Figures

Figures reproduced from arXiv: 1907.11639 by Michael Hauser.

**Figure 1.** Figure 1: Routing diagram between layers with 144 capsules at layer l and 10 capsules at layer l + 1. The rectangles represent capsules, and are color coded so that lighter colors represent low-probability activations while darker colors represent high-probability activations. Similarly, the edges connecting the rectangles are the routing coefficients c (l) ij ’s, and for lighter edges there is a lower routing weigh… view at source ↗

**Figure 2.** Figure 2: The network architecture used for the experiments. First the autoen [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Images created by the unsupervised, routing-weighted product of expert [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Capsules are the multidimensional analogue to scalar neurons in neural networks, and because they are multidimensional, much more complex routing schemes can be used to pass information forward through the network than what can be used in traditional neural networks. This work treats capsules as collections of neurons in a fully connected neural network, where sub-networks connecting capsules are weighted according to the routing coefficients determined by routing by agreement. An energy function is designed to reflect this model, and it follows that capsule networks with dynamic routing can be formulated as a product of expert neurons. By alternating between dynamic routing, which acts to both find subnetworks within the overall network as well as to mix the model distribution, and updating the parameters by the gradient of the contrastive divergence, a bottom-up, unsupervised learning algorithm is constructed for capsule networks with dynamic routing. The model and its training algorithm are qualitatively tested in the generative sense, and is able to produce realistic looking images from standard vision datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that capsule networks using dynamic routing-by-agreement can be reformulated as a product-of-experts model in which routing coefficients act as weights selecting sub-networks within a larger fully-connected network. An energy function is constructed to encode this structure, so that alternating dynamic routing steps (which both discover sub-networks and mix the model distribution) with parameter updates via the gradient of contrastive divergence yields an unsupervised, bottom-up training procedure. The resulting model is evaluated qualitatively by its ability to generate realistic images on standard vision datasets.

Significance. If the energy-function equivalence is rigorously established and the iterative nature of routing-by-agreement is correctly captured as static sub-network weights, the work would supply a novel unsupervised training route for capsule networks that directly connects them to product-of-experts energy-based models. The approach also supplies an explicit mechanism for bottom-up learning without labeled data, which is a recognized gap in current capsule literature.

major comments (2)

[Abstract] Abstract (paragraph beginning 'An energy function is designed to reflect this model'): the central claim that capsule networks 'can be formulated as a product of expert neurons' rests on the assertion that the designed energy function correctly encodes routing coefficients as fixed sub-network weights. No derivation is supplied showing how the iterative, agreement-dependent recomputation of routing coefficients is reduced to a static product-of-experts factorization; without this step the contrastive-divergence gradient cannot be guaranteed to correspond to the claimed model.
[Abstract] Abstract (final sentence on qualitative testing): the generative results are described only as 'able to produce realistic looking images' with no controls, baselines, or quantitative metrics (e.g., FID, reconstruction error, or comparison against a standard capsule auto-encoder). Because the unsupervised claim is load-bearing, the absence of any falsifiable evaluation metric leaves the practical utility of the alternating routing-plus-CD procedure untested.

minor comments (1)

[Abstract] The abstract does not define the precise form of the energy function or the contrastive-divergence objective, making it impossible for a reader to verify the product-of-experts reduction without the full manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the equivalence and the evaluation.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph beginning 'An energy function is designed to reflect this model'): the central claim that capsule networks 'can be formulated as a product of expert neurons' rests on the assertion that the designed energy function correctly encodes routing coefficients as fixed sub-network weights. No derivation is supplied showing how the iterative, agreement-dependent recomputation of routing coefficients is reduced to a static product-of-experts factorization; without this step the contrastive-divergence gradient cannot be guaranteed to correspond to the claimed model.

Authors: The manuscript constructs the energy function to encode the routing coefficients as selecting sub-networks (product-of-experts terms) within the larger network. We acknowledge that the abstract does not include the full step-by-step reduction from iterative routing-by-agreement to the static factorization used for the energy model. In revision we will add an explicit derivation subsection showing how each routing iteration produces fixed expert weights for the purpose of the contrastive-divergence update, thereby justifying the gradient. revision: yes
Referee: [Abstract] Abstract (final sentence on qualitative testing): the generative results are described only as 'able to produce realistic looking images' with no controls, baselines, or quantitative metrics (e.g., FID, reconstruction error, or comparison against a standard capsule auto-encoder). Because the unsupervised claim is load-bearing, the absence of any falsifiable evaluation metric leaves the practical utility of the alternating routing-plus-CD procedure untested.

Authors: We agree that the current evaluation is purely qualitative. In the revised manuscript we will report FID scores, reconstruction errors, and direct comparisons against a standard capsule auto-encoder and other unsupervised generative baselines on the same datasets, providing quantitative support for the utility of the alternating procedure. revision: yes

Circularity Check

1 steps flagged

Energy function designed to reflect routing model makes product-of-experts formulation tautological

specific steps

self definitional [Abstract]
"An energy function is designed to reflect this model, and it follows that capsule networks with dynamic routing can be formulated as a product of expert neurons."

The energy function is introduced with the stated goal of reflecting the routing-weighted capsule model; the product-of-experts formulation is then asserted to follow directly from that design. The claimed equivalence therefore reduces to the modeling assumption itself rather than emerging from an independent derivation.

full rationale

The paper's core claim rests on designing an energy function whose explicit purpose is to encode the routing-weighted sub-network structure, after which the product-of-experts equivalence is stated to 'follow'. This matches the self-definitional pattern: the equivalence is a direct consequence of the modeling choice rather than an independent derivation from first principles or external constraints. No other load-bearing steps (self-citations, fitted predictions, or imported uniqueness theorems) appear in the provided text. The derivation is therefore partially circular by construction of the central modeling step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of an energy function that exactly encodes the routing-weighted sub-network structure; no free parameters, additional axioms, or invented entities are stated in the abstract.

axioms (1)

domain assumption An energy function can be designed that reflects the model of capsules as routing-weighted sub-networks inside a fully connected network.
Abstract states 'An energy function is designed to reflect this model' immediately before claiming the product-of-experts equivalence.

pith-pipeline@v0.9.0 · 5686 in / 1307 out tokens · 25918 ms · 2026-05-24T15:04:08.677191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Transforming auto-encoders

Geoﬀrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artiﬁcial Neural Networks, pages 44–51. Springer, 2011

work page 2011
[2]

Dynamic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoﬀrey E Hinton. Dynamic routing between capsules. InAdvances in neural information processing systems, pages 3856–3866, 2017

work page 2017
[3]

A logical calculus of the ideas immanent in nervous activity

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943

work page 1943
[4]

Gradient- based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haﬀner, et al. Gradient- based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[5]

Imagenet classi- ﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classi- ﬁcation with deep convolutional neural networks. InAdvances in neural information processing systems, pages 1097–1105, 2012

work page 2012
[6]

Learning internal representations by error propagation

David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985

work page 1985
[7]

Matrix capsules with em routing

Geoﬀrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018

work page 2018
[8]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014
[9]

Capsulegan: Generative adversarial capsule network

Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. InProceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018

work page 2018
[10]

Training products of experts by minimizing contrastive divergence

Geoﬀrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

work page 2002
[11]

Exponential family harmoniums with an application to information retrieval

Max Welling, Michal Rosen-Zvi, and Geoﬀrey E Hinton. Exponential family harmoniums with an application to information retrieval. InAdvances in neural information processing systems, pages 1481–1488, 2005

work page 2005
[12]

An introduction to restricted boltzmann machines

Asja Fischer and Christian Igel. An introduction to restricted boltzmann machines. In iberoamerican congress on pattern recognition, pages 14–36. Springer, 2012

work page 2012
[13]

Deconvolutional networks

Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Robert Fergus. Deconvolutional networks. InCvpr, volume 10, page 7, 2010. 12

work page 2010
[14]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929– 1958, 2014

work page 1929
[15]

Tensorﬂow: A system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. 13

work page 2016

[1] [1]

Transforming auto-encoders

Geoﬀrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artiﬁcial Neural Networks, pages 44–51. Springer, 2011

work page 2011

[2] [2]

Dynamic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoﬀrey E Hinton. Dynamic routing between capsules. InAdvances in neural information processing systems, pages 3856–3866, 2017

work page 2017

[3] [3]

A logical calculus of the ideas immanent in nervous activity

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943

work page 1943

[4] [4]

Gradient- based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haﬀner, et al. Gradient- based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998

[5] [5]

Imagenet classi- ﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classi- ﬁcation with deep convolutional neural networks. InAdvances in neural information processing systems, pages 1097–1105, 2012

work page 2012

[6] [6]

Learning internal representations by error propagation

David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985

work page 1985

[7] [7]

Matrix capsules with em routing

Geoﬀrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018

work page 2018

[8] [8]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014

[9] [9]

Capsulegan: Generative adversarial capsule network

Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. InProceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018

work page 2018

[10] [10]

Training products of experts by minimizing contrastive divergence

Geoﬀrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

work page 2002

[11] [11]

Exponential family harmoniums with an application to information retrieval

Max Welling, Michal Rosen-Zvi, and Geoﬀrey E Hinton. Exponential family harmoniums with an application to information retrieval. InAdvances in neural information processing systems, pages 1481–1488, 2005

work page 2005

[12] [12]

An introduction to restricted boltzmann machines

Asja Fischer and Christian Igel. An introduction to restricted boltzmann machines. In iberoamerican congress on pattern recognition, pages 14–36. Springer, 2012

work page 2012

[13] [13]

Deconvolutional networks

Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Robert Fergus. Deconvolutional networks. InCvpr, volume 10, page 7, 2010. 12

work page 2010

[14] [14]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929– 1958, 2014

work page 1929

[15] [15]

Tensorﬂow: A system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. 13

work page 2016