Training products of expert capsules with mixing by dynamic routing

Michael Hauser

arxiv: 1907.11643 · v1 · pith:IHT7SOTOnew · submitted 2019-07-26 · 💻 cs.LG · cs.NE· stat.ML

Training products of expert capsules with mixing by dynamic routing

Michael Hauser This is my paper

Pith reviewed 2026-05-24 15:38 UTC · model grok-4.3

classification 💻 cs.LG cs.NEstat.ML

keywords capsule networksdynamic routingunsupervised learningproducts of expertsenergy functiongenerative modelsimage generation

0 comments

The pith

An energy function aligned with dynamic routing enables unsupervised training of capsule networks to generate realistic images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an unsupervised learning algorithm for products of expert capsules. It defines an energy function where the magnitude of a squashed capsule represents the probability of that capsule firing, analogous to binary neurons in restricted Boltzmann machines. The energy function is constructed to remain consistent with dynamic routing so that inference uses the existing routing procedure and sampling stays efficient without connections among hidden nodes. The gradient of the log-likelihood for the visible capsules is derived from this energy, and the resulting algorithm is applied to standard vision datasets where the model produces realistic images drawn from its learned distribution.

Core claim

By constructing an energy function for capsule networks that matches the probability of capsule firing under dynamic routing, it becomes possible to optimize the log-likelihood of the visible layer and obtain a generative model that produces realistic images from its distribution after training on vision datasets.

What carries the argument

Energy function for products of expert capsules made consistent with dynamic routing probabilities of capsule firing, supporting efficient sampling without hidden-layer connections.

If this is right

Inference proceeds by applying the dynamic routing between capsules procedure.
The gradient for log-likelihood optimization is obtained directly from the energy function.
The trained capsule network generates realistic looking images drawn from the learned distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic routing functions simultaneously as the inference method and the mechanism for mixing the product of expert capsules.
The same energy-based construction may extend the use of capsule networks from classification to density estimation tasks.

Load-bearing premise

The energy function can be made consistent with dynamic routing in the sense of the probability of a capsule firing while still allowing an efficient sampling procedure where hidden layer nodes are not connected.

What would settle it

Generating images from the trained model and verifying whether their distribution matches the statistics or visual appearance of the original training dataset on standard vision benchmarks.

Figures

Figures reproduced from arXiv: 1907.11643 by Michael Hauser.

**Figure 2.** Figure 2: Images created by the unsupervised, routing-weighted product of expert [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule network on standard vision datasets, and is able to generate realistic looking images from its learned distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches an energy-based unsupervised trainer for capsule networks that ties firing probabilities to dynamic routing, but the central sampling-efficiency claim is unsupported by any shown derivation and the results stay purely qualitative.

read the letter

The main takeaway is that this work tries to give capsule networks an RBM-style generative training path. It defines an energy function whose marginals on capsule magnitudes are made to match the probabilities coming out of dynamic routing, then uses that to get a gradient for the visible-layer log-likelihood. Training on standard vision sets produces images described as realistic. That combination of products of experts, dynamic routing, and an energy function is the new piece; prior capsule papers stayed supervised or used different generative tricks. The motivation from squashed capsule outputs to [0,1] probabilities is straightforward and the decision to keep hidden nodes unconnected for sampling is clearly stated as a design goal. The soft spot is exactly where the stress-test note points: dynamic routing builds coupling coefficients through iterative agreement across layers, so any energy whose gradients recover those same marginals will normally add cross terms among the hidden capsules. The abstract asserts that consistency was achieved while preserving the factorized conditional, but supplies no explicit construction, cancellation step, or gradient derivation to show how the cross terms are avoided. Without that step the efficient sampling procedure does not follow. The experiments add nothing quantitative—no likelihood numbers, no comparisons, no checks on whether the consistency actually held—so it is impossible to tell whether the method works as claimed. This is for the narrow set of people already working on generative extensions of capsules. A reader outside that subfield gets little. The missing derivation on the energy function is load-bearing and the evidence is too thin, so I would not send it to referees.

Referee Report

1 major / 2 minor

Summary. The paper develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. It designs an energy function analogous to RBMs in which the magnitude of a squashed capsule represents its firing probability (between 0 and 1), makes this energy consistent with dynamic routing so that inference uses the standard routing procedure, derives the gradient of the visible-layer log-likelihood with respect to the energy, and applies the resulting algorithm to train capsule networks on standard vision datasets, claiming the model can generate realistic images from the learned distribution.

Significance. If the central consistency construction holds without hidden dependencies, the work would supply a principled energy-based training procedure for capsule networks that preserves the dynamic-routing inference mechanism and enables sampling, thereby linking capsule architectures to products-of-experts models and offering a route to unsupervised generative modeling with explicit part-whole structure.

major comments (1)

[Abstract (energy-function consistency step) and any derivation section] The central technical claim—that an energy function can be constructed whose marginals on capsule firing probabilities exactly reproduce dynamic routing outputs while the conditional distributions over hidden capsules remain factorized (no intra-hidden connections)—is load-bearing for both the efficient sampling procedure and the gradient derivation. The manuscript must exhibit the explicit cancellation or avoidance of cross terms that would otherwise arise from the iterative agreement computation in dynamic routing; absent this construction (presumably in the energy definition or the log-likelihood gradient), the sampling tractability asserted in the abstract does not follow.

minor comments (2)

[Abstract] The abstract states that realistic images are generated but supplies neither quantitative metrics (e.g., FID, reconstruction error) nor qualitative figure references; these should be added to allow verification of the generative claim.
[Abstract / Methods] Notation for the energy function, the squashing operation, and the consistency condition should be introduced with explicit equations rather than prose descriptions alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for identifying the need for greater clarity on the energy-function construction. We address the major comment below and will revise accordingly.

read point-by-point responses

Referee: The central technical claim—that an energy function can be constructed whose marginals on capsule firing probabilities exactly reproduce dynamic routing outputs while the conditional distributions over hidden capsules remain factorized (no intra-hidden connections)—is load-bearing for both the efficient sampling procedure and the gradient derivation. The manuscript must exhibit the explicit cancellation or avoidance of cross terms that would otherwise arise from the iterative agreement computation in dynamic routing; absent this construction (presumably in the energy definition or the log-likelihood gradient), the sampling tractability asserted in the abstract does not follow.

Authors: We agree that an explicit demonstration of how cross terms cancel (or are avoided) is required to substantiate the claim of tractable sampling and the gradient derivation. The current manuscript states that the energy is made consistent with dynamic routing via the squashed capsule magnitudes but does not walk through the cancellation algebraically. In the revision we will add a dedicated subsection (in the energy definition and gradient sections) that (i) defines the energy as a sum of per-capsule expert terms whose potentials depend only on the visible and routed hidden activations, (ii) shows that the fixed-point of the routing iteration supplies the marginal firing probabilities without introducing additional intra-hidden edges, and (iii) derives the resulting log-likelihood gradient under the product-of-experts factorization. This will make the absence of cross terms explicit. revision: yes

Circularity Check

1 steps flagged

Energy function is defined to match dynamic routing firing probabilities by construction

specific steps

self definitional [Abstract]
"In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure."

The energy function is explicitly engineered so its marginals on firing probabilities equal those produced by dynamic routing; the sampling tractability then follows from this enforced match rather than from an independent derivation that would naturally produce both the probability agreement and the factorization over hidden nodes.

full rationale

The paper's unsupervised learning procedure rests on constructing an energy function whose capsule firing probabilities are forced to reproduce those of dynamic routing, while separately asserting that hidden-layer nodes remain unconnected for tractable sampling. This consistency step is presented as a design choice that enables the rest of the algorithm (gradient of log-likelihood, image generation), but the match is imposed rather than independently derived. No external verification or cancellation of cross terms is exhibited in the provided text, so the central claim reduces to the definitional adjustment. This matches the self-definitional pattern and justifies a moderate circularity score; the remainder of the derivation (training on vision datasets) inherits the construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited from the full manuscript.

pith-pipeline@v0.9.0 · 5667 in / 1103 out tokens · 27401 ms · 2026-05-24T15:38:46.649889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Training products of experts by minimizing contrastive divergence

Geoﬀrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

work page 2002
[2]

Transforming auto-encoders

Geoﬀrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artiﬁcial Neural Networks, pages 44–51. Springer, 2011

work page 2011
[3]

Dynamic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoﬀrey E Hinton. Dynamic routing between capsules. InAdvances in neural information processing systems, pages 3856–3866, 2017

work page 2017
[4]

A logical calculus of the ideas immanent in nervous activity

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943

work page 1943
[5]

Learning internal representations by error propagation

David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985

work page 1985
[6]

Matrix capsules with em routing

Geoﬀrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018

work page 2018
[7]

Capsulegan: Generative adversarial capsule network

Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. InProceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018

work page 2018
[8]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014
[9]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

Exponential family harmoniums with an application to information retrieval

Max Welling, Michal Rosen-Zvi, and Geoﬀrey E Hinton. Exponential family harmoniums with an application to information retrieval. InAdvances in neural information processing systems, pages 1481–1488, 2005

work page 2005
[11]

Visualizing and understanding convolu- tional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolu- tional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

work page 2014
[12]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929– 1958, 2014. 12

work page 1929
[13]

Tensorﬂow: A system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. 13

work page 2016

[1] [1]

Training products of experts by minimizing contrastive divergence

Geoﬀrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

work page 2002

[2] [2]

Transforming auto-encoders

Geoﬀrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artiﬁcial Neural Networks, pages 44–51. Springer, 2011

work page 2011

[3] [3]

Dynamic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoﬀrey E Hinton. Dynamic routing between capsules. InAdvances in neural information processing systems, pages 3856–3866, 2017

work page 2017

[4] [4]

A logical calculus of the ideas immanent in nervous activity

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943

work page 1943

[5] [5]

Learning internal representations by error propagation

David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985

work page 1985

[6] [6]

Matrix capsules with em routing

Geoﬀrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018

work page 2018

[7] [7]

Capsulegan: Generative adversarial capsule network

Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. InProceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018

work page 2018

[8] [8]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014

[9] [9]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[10] [10]

Exponential family harmoniums with an application to information retrieval

Max Welling, Michal Rosen-Zvi, and Geoﬀrey E Hinton. Exponential family harmoniums with an application to information retrieval. InAdvances in neural information processing systems, pages 1481–1488, 2005

work page 2005

[11] [11]

Visualizing and understanding convolu- tional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolu- tional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

work page 2014

[12] [12]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929– 1958, 2014. 12

work page 1929

[13] [13]

Tensorﬂow: A system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. 13

work page 2016