pith. sign in

arxiv: 1907.11643 · v1 · pith:IHT7SOTOnew · submitted 2019-07-26 · 💻 cs.LG · cs.NE· stat.ML

Training products of expert capsules with mixing by dynamic routing

Pith reviewed 2026-05-24 15:38 UTC · model grok-4.3

classification 💻 cs.LG cs.NEstat.ML
keywords capsule networksdynamic routingunsupervised learningproducts of expertsenergy functiongenerative modelsimage generation
0
0 comments X

The pith

An energy function aligned with dynamic routing enables unsupervised training of capsule networks to generate realistic images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an unsupervised learning algorithm for products of expert capsules. It defines an energy function where the magnitude of a squashed capsule represents the probability of that capsule firing, analogous to binary neurons in restricted Boltzmann machines. The energy function is constructed to remain consistent with dynamic routing so that inference uses the existing routing procedure and sampling stays efficient without connections among hidden nodes. The gradient of the log-likelihood for the visible capsules is derived from this energy, and the resulting algorithm is applied to standard vision datasets where the model produces realistic images drawn from its learned distribution.

Core claim

By constructing an energy function for capsule networks that matches the probability of capsule firing under dynamic routing, it becomes possible to optimize the log-likelihood of the visible layer and obtain a generative model that produces realistic images from its distribution after training on vision datasets.

What carries the argument

Energy function for products of expert capsules made consistent with dynamic routing probabilities of capsule firing, supporting efficient sampling without hidden-layer connections.

If this is right

  • Inference proceeds by applying the dynamic routing between capsules procedure.
  • The gradient for log-likelihood optimization is obtained directly from the energy function.
  • The trained capsule network generates realistic looking images drawn from the learned distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic routing functions simultaneously as the inference method and the mechanism for mixing the product of expert capsules.
  • The same energy-based construction may extend the use of capsule networks from classification to density estimation tasks.

Load-bearing premise

The energy function can be made consistent with dynamic routing in the sense of the probability of a capsule firing while still allowing an efficient sampling procedure where hidden layer nodes are not connected.

What would settle it

Generating images from the trained model and verifying whether their distribution matches the statistics or visual appearance of the original training dataset on standard vision benchmarks.

Figures

Figures reproduced from arXiv: 1907.11643 by Michael Hauser.

Figure 1
Figure 1. Figure 1: The network architecture used for these experiments. First a convolu [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Images created by the unsupervised, routing-weighted product of expert [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule network on standard vision datasets, and is able to generate realistic looking images from its learned distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. It designs an energy function analogous to RBMs in which the magnitude of a squashed capsule represents its firing probability (between 0 and 1), makes this energy consistent with dynamic routing so that inference uses the standard routing procedure, derives the gradient of the visible-layer log-likelihood with respect to the energy, and applies the resulting algorithm to train capsule networks on standard vision datasets, claiming the model can generate realistic images from the learned distribution.

Significance. If the central consistency construction holds without hidden dependencies, the work would supply a principled energy-based training procedure for capsule networks that preserves the dynamic-routing inference mechanism and enables sampling, thereby linking capsule architectures to products-of-experts models and offering a route to unsupervised generative modeling with explicit part-whole structure.

major comments (1)
  1. [Abstract (energy-function consistency step) and any derivation section] The central technical claim—that an energy function can be constructed whose marginals on capsule firing probabilities exactly reproduce dynamic routing outputs while the conditional distributions over hidden capsules remain factorized (no intra-hidden connections)—is load-bearing for both the efficient sampling procedure and the gradient derivation. The manuscript must exhibit the explicit cancellation or avoidance of cross terms that would otherwise arise from the iterative agreement computation in dynamic routing; absent this construction (presumably in the energy definition or the log-likelihood gradient), the sampling tractability asserted in the abstract does not follow.
minor comments (2)
  1. [Abstract] The abstract states that realistic images are generated but supplies neither quantitative metrics (e.g., FID, reconstruction error) nor qualitative figure references; these should be added to allow verification of the generative claim.
  2. [Abstract / Methods] Notation for the energy function, the squashing operation, and the consistency condition should be introduced with explicit equations rather than prose descriptions alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for identifying the need for greater clarity on the energy-function construction. We address the major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: The central technical claim—that an energy function can be constructed whose marginals on capsule firing probabilities exactly reproduce dynamic routing outputs while the conditional distributions over hidden capsules remain factorized (no intra-hidden connections)—is load-bearing for both the efficient sampling procedure and the gradient derivation. The manuscript must exhibit the explicit cancellation or avoidance of cross terms that would otherwise arise from the iterative agreement computation in dynamic routing; absent this construction (presumably in the energy definition or the log-likelihood gradient), the sampling tractability asserted in the abstract does not follow.

    Authors: We agree that an explicit demonstration of how cross terms cancel (or are avoided) is required to substantiate the claim of tractable sampling and the gradient derivation. The current manuscript states that the energy is made consistent with dynamic routing via the squashed capsule magnitudes but does not walk through the cancellation algebraically. In the revision we will add a dedicated subsection (in the energy definition and gradient sections) that (i) defines the energy as a sum of per-capsule expert terms whose potentials depend only on the visible and routed hidden activations, (ii) shows that the fixed-point of the routing iteration supplies the marginal firing probabilities without introducing additional intra-hidden edges, and (iii) derives the resulting log-likelihood gradient under the product-of-experts factorization. This will make the absence of cross terms explicit. revision: yes

Circularity Check

1 steps flagged

Energy function is defined to match dynamic routing firing probabilities by construction

specific steps
  1. self definitional [Abstract]
    "In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure."

    The energy function is explicitly engineered so its marginals on firing probabilities equal those produced by dynamic routing; the sampling tractability then follows from this enforced match rather than from an independent derivation that would naturally produce both the probability agreement and the factorization over hidden nodes.

full rationale

The paper's unsupervised learning procedure rests on constructing an energy function whose capsule firing probabilities are forced to reproduce those of dynamic routing, while separately asserting that hidden-layer nodes remain unconnected for tractable sampling. This consistency step is presented as a design choice that enables the rest of the algorithm (gradient of log-likelihood, image generation), but the match is imposed rather than independently derived. No external verification or cancellation of cross terms is exhibited in the provided text, so the central claim reduces to the definitional adjustment. This matches the self-definitional pattern and justifies a moderate circularity score; the remainder of the derivation (training on vision datasets) inherits the construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited from the full manuscript.

pith-pipeline@v0.9.0 · 5667 in / 1103 out tokens · 27401 ms · 2026-05-24T15:38:46.649889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Training products of experts by minimizing contrastive divergence

    Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

  2. [2]

    Transforming auto-encoders

    Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011

  3. [3]

    Dynamic routing between capsules

    Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. InAdvances in neural information processing systems, pages 3856–3866, 2017

  4. [4]

    A logical calculus of the ideas immanent in nervous activity

    Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943

  5. [5]

    Learning internal representations by error propagation

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985

  6. [6]

    Matrix capsules with em routing

    Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018

  7. [7]

    Capsulegan: Generative adversarial capsule network

    Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. InProceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018

  8. [8]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

  9. [9]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

  10. [10]

    Exponential family harmoniums with an application to information retrieval

    Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. Exponential family harmoniums with an application to information retrieval. InAdvances in neural information processing systems, pages 1481–1488, 2005

  11. [11]

    Visualizing and understanding convolu- tional networks

    Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolu- tional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

  12. [12]

    Dropout: a simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929– 1958, 2014. 12

  13. [13]

    Tensorflow: A system for large-scale machine learning

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. 13