Training products of expert capsules with mixing by dynamic routing
Pith reviewed 2026-05-24 15:38 UTC · model grok-4.3
The pith
An energy function aligned with dynamic routing enables unsupervised training of capsule networks to generate realistic images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing an energy function for capsule networks that matches the probability of capsule firing under dynamic routing, it becomes possible to optimize the log-likelihood of the visible layer and obtain a generative model that produces realistic images from its distribution after training on vision datasets.
What carries the argument
Energy function for products of expert capsules made consistent with dynamic routing probabilities of capsule firing, supporting efficient sampling without hidden-layer connections.
If this is right
- Inference proceeds by applying the dynamic routing between capsules procedure.
- The gradient for log-likelihood optimization is obtained directly from the energy function.
- The trained capsule network generates realistic looking images drawn from the learned distribution.
Where Pith is reading between the lines
- Dynamic routing functions simultaneously as the inference method and the mechanism for mixing the product of expert capsules.
- The same energy-based construction may extend the use of capsule networks from classification to density estimation tasks.
Load-bearing premise
The energy function can be made consistent with dynamic routing in the sense of the probability of a capsule firing while still allowing an efficient sampling procedure where hidden layer nodes are not connected.
What would settle it
Generating images from the trained model and verifying whether their distribution matches the statistics or visual appearance of the original training dataset on standard vision benchmarks.
Figures
read the original abstract
This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule network on standard vision datasets, and is able to generate realistic looking images from its learned distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. It designs an energy function analogous to RBMs in which the magnitude of a squashed capsule represents its firing probability (between 0 and 1), makes this energy consistent with dynamic routing so that inference uses the standard routing procedure, derives the gradient of the visible-layer log-likelihood with respect to the energy, and applies the resulting algorithm to train capsule networks on standard vision datasets, claiming the model can generate realistic images from the learned distribution.
Significance. If the central consistency construction holds without hidden dependencies, the work would supply a principled energy-based training procedure for capsule networks that preserves the dynamic-routing inference mechanism and enables sampling, thereby linking capsule architectures to products-of-experts models and offering a route to unsupervised generative modeling with explicit part-whole structure.
major comments (1)
- [Abstract (energy-function consistency step) and any derivation section] The central technical claim—that an energy function can be constructed whose marginals on capsule firing probabilities exactly reproduce dynamic routing outputs while the conditional distributions over hidden capsules remain factorized (no intra-hidden connections)—is load-bearing for both the efficient sampling procedure and the gradient derivation. The manuscript must exhibit the explicit cancellation or avoidance of cross terms that would otherwise arise from the iterative agreement computation in dynamic routing; absent this construction (presumably in the energy definition or the log-likelihood gradient), the sampling tractability asserted in the abstract does not follow.
minor comments (2)
- [Abstract] The abstract states that realistic images are generated but supplies neither quantitative metrics (e.g., FID, reconstruction error) nor qualitative figure references; these should be added to allow verification of the generative claim.
- [Abstract / Methods] Notation for the energy function, the squashing operation, and the consistency condition should be introduced with explicit equations rather than prose descriptions alone.
Simulated Author's Rebuttal
We thank the referee for their careful review and for identifying the need for greater clarity on the energy-function construction. We address the major comment below and will revise accordingly.
read point-by-point responses
-
Referee: The central technical claim—that an energy function can be constructed whose marginals on capsule firing probabilities exactly reproduce dynamic routing outputs while the conditional distributions over hidden capsules remain factorized (no intra-hidden connections)—is load-bearing for both the efficient sampling procedure and the gradient derivation. The manuscript must exhibit the explicit cancellation or avoidance of cross terms that would otherwise arise from the iterative agreement computation in dynamic routing; absent this construction (presumably in the energy definition or the log-likelihood gradient), the sampling tractability asserted in the abstract does not follow.
Authors: We agree that an explicit demonstration of how cross terms cancel (or are avoided) is required to substantiate the claim of tractable sampling and the gradient derivation. The current manuscript states that the energy is made consistent with dynamic routing via the squashed capsule magnitudes but does not walk through the cancellation algebraically. In the revision we will add a dedicated subsection (in the energy definition and gradient sections) that (i) defines the energy as a sum of per-capsule expert terms whose potentials depend only on the visible and routed hidden activations, (ii) shows that the fixed-point of the routing iteration supplies the marginal firing probabilities without introducing additional intra-hidden edges, and (iii) derives the resulting log-likelihood gradient under the product-of-experts factorization. This will make the absence of cross terms explicit. revision: yes
Circularity Check
Energy function is defined to match dynamic routing firing probabilities by construction
specific steps
-
self definitional
[Abstract]
"In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure."
The energy function is explicitly engineered so its marginals on firing probabilities equal those produced by dynamic routing; the sampling tractability then follows from this enforced match rather than from an independent derivation that would naturally produce both the probability agreement and the factorization over hidden nodes.
full rationale
The paper's unsupervised learning procedure rests on constructing an energy function whose capsule firing probabilities are forced to reproduce those of dynamic routing, while separately asserting that hidden-layer nodes remain unconnected for tractable sampling. This consistency step is presented as a design choice that enables the rest of the algorithm (gradient of log-likelihood, image generation), but the match is imposed rather than independently derived. No external verification or cancellation of cross terms is exhibited in the provided text, so the central claim reduces to the definitional adjustment. This matches the self-definitional pattern and justifies a moderate circularity score; the remainder of the derivation (training on vision datasets) inherits the construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training products of experts by minimizing contrastive divergence
Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002
work page 2002
-
[2]
Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011
work page 2011
-
[3]
Dynamic routing between capsules
Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. InAdvances in neural information processing systems, pages 3856–3866, 2017
work page 2017
-
[4]
A logical calculus of the ideas immanent in nervous activity
Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943
work page 1943
-
[5]
Learning internal representations by error propagation
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985
work page 1985
-
[6]
Matrix capsules with em routing
Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018
work page 2018
-
[7]
Capsulegan: Generative adversarial capsule network
Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. InProceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018
work page 2018
-
[8]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014
work page 2014
-
[9]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Exponential family harmoniums with an application to information retrieval
Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. Exponential family harmoniums with an application to information retrieval. InAdvances in neural information processing systems, pages 1481–1488, 2005
work page 2005
-
[11]
Visualizing and understanding convolu- tional networks
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolu- tional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014
work page 2014
-
[12]
Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929– 1958, 2014. 12
work page 1929
-
[13]
Tensorflow: A system for large-scale machine learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. 13
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.