Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

Abhronil Sengupta; Jiaqi Lin; Malyaban Bal

arxiv: 2508.15989 · v2 · submitted 2025-08-21 · 💻 cs.LG · cs.ET

Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

Jiaqi Lin , Malyaban Bal , Abhronil Sengupta This is my paper

Pith reviewed 2026-05-18 21:18 UTC · model grok-4.3

classification 💻 cs.LG cs.ET

keywords Equilibrium PropagationKnowledge DistillationLocal Learning RulesDeep Convolutional NetworksVanishing GradientsCRNN TrainingCIFAR-10CIFAR-100

0 comments

The pith

Layer-wise auxiliary supervision signals allow Equilibrium Propagation to train deep convolutional CRNNs by fixing vanishing gradient issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Equilibrium Propagation is a local learning rule for recurrent networks that updates synapses using only two phases of neuron states instead of full backpropagation. Prior versions worked only for shallow networks because gradients disappeared in deeper layers and prevented reliable convergence. This paper adds intermediate error signals at each layer by combining knowledge distillation with local supervision to guide the dynamics without breaking the locality of the updates. The result is that much deeper convolutional architectures become trainable and reach strong performance on standard image benchmarks. If the approach holds, local rules could become practical for deeper models where they previously could not.

Core claim

The authors establish that integrating knowledge distillation and local error signals into Equilibrium Propagation supplies auxiliary supervision at intermediate layers. These signals improve the convergence of neuron dynamics in deep networks while preserving the strictly local character of the synaptic updates. The framework therefore scales to deep VGG-style convolutional CRNNs and delivers state-of-the-art accuracy on the CIFAR-10 and CIFAR-100 datasets.

What carries the argument

Layer-wise auxiliary supervision signals derived from knowledge distillation and local error signals, which stabilize neuron dynamics convergence during the two-phase equilibrium process at every layer.

If this is right

Deep convolutional CRNNs can be trained end-to-end using only local updates from Equilibrium Propagation.
State-of-the-art accuracy is reached on CIFAR-10 and CIFAR-100 with deep VGG architectures.
The vanishing gradient barrier that previously limited EP to shallow models is removed.
Locality of updates is retained, keeping the method compatible with neuromorphic hardware constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-wise signals could be tested with other local rules such as those based on predictive coding to see whether they generalize.
Hardware designs for on-chip learning might need fewer global routing resources once intermediate signals are available locally.
Further experiments on sequential or video data would show whether the auxiliary signals remain effective outside static image tasks.
If the signals can be generated with very low extra cost, Equilibrium Propagation could become competitive for continual learning on edge devices.

Load-bearing premise

The extra error signals supplied at each layer will make neuron dynamics converge reliably in deep networks without adding overhead that removes the locality or efficiency advantages of the original Equilibrium Propagation rule.

What would settle it

Applying the method to a deep VGG network on CIFAR-100 and observing that training still fails to converge or reaches markedly lower accuracy than backpropagation would show the signals do not solve the scalability problem.

Figures

Figures reproduced from arXiv: 2508.15989 by Abhronil Sengupta, Jiaqi Lin, Malyaban Bal.

**Figure 1.** Figure 1: (a) Overview of convolutional CRNNs trained via the EP framework, showing information flow during forward and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Simplified architecture of augmented EP framework. (Left) Local error method augments the intermediate representa [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The gradient of VGG-7 scalar primitive function, trained by the augmented EP framework, converges to steady [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of VGG-7 network on the CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The neuron activations and weight gradients at [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Magnitude decay of various learning signal sched [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To alleviate the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates layer-wise learning signals to provide auxiliary supervision, which enhances the convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, suggesting that intermediate learning signals can extend the practical applicability of EP to deeper architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds layer-wise auxiliary signals to Equilibrium Propagation to reach deeper conv nets on CIFAR, but the signals' locality and the strength of the reported gains still need checking.

read the letter

The main point here is that the authors combine knowledge-distillation-style intermediate error signals with Equilibrium Propagation to train deeper convolutional CRNNs. This lets them scale past the shallow limits that have held EP back, and they report results on VGG architectures for CIFAR-10 and CIFAR-100 that they describe as state-of-the-art. The integration itself looks like the concrete new piece relative to earlier EP papers, which mostly stayed with shallow nets and standard two-phase updates. The approach directly targets the vanishing-gradient problem during both the energy minimization and the gradient estimation steps, which is a practical bottleneck for anyone trying to use EP in hardware settings. If the auxiliary signals stay local and cheap, this could genuinely widen the range of networks where EP remains competitive with backprop-style methods. The experiments on deeper models give some evidence that the idea moves the needle on depth scaling, which is the part worth paying attention to. That said, the central claim rests on the auxiliary signals being computed strictly from adjacent-layer states without extra propagation or global targets. The abstract does not spell out the exact equations, so it is hard to tell whether the fix preserves the original locality guarantee or quietly reintroduces something closer to backprop. If the signals turn out to require non-local information, the scalability benefit comes at the cost of the property that made EP interesting in the first place. The lack of visible ablations, error bars, or baseline comparisons in the summary also makes the size of the improvement difficult to judge. This work is aimed at researchers who already follow local learning rules and neuromorphic training. A reader looking for concrete extensions of EP will find usable experimental results on deeper nets. The paper is coherent enough on its own terms to deserve a serious referee, mainly so the locality question and the experimental controls can be examined in detail. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a novel Equilibrium Propagation (EP) framework for deep convolutional CRNNs that incorporates layer-wise auxiliary learning signals, drawing from knowledge distillation and local error signals, to mitigate vanishing gradients during energy minimization and gradient estimation. This enables training of significantly deeper VGG-style architectures and is claimed to achieve state-of-the-art results on CIFAR-10 and CIFAR-100 while preserving the local, two-phase synaptic update property of EP.

Significance. If the central claims hold, the work would represent a meaningful advance in making EP scalable to practical deep networks, potentially broadening its relevance for neuromorphic hardware and biologically plausible learning. The explicit integration of intermediate supervision to address depth limitations in EP is a targeted contribution, though its impact depends on rigorous verification that locality and efficiency advantages are retained.

major comments (3)

[§3.2, Eq. (8)] §3.2, Eq. (8): The layer-wise error signal is defined using a distillation loss between student and teacher activations, but the manuscript does not derive or demonstrate that this signal can be computed exclusively from adjacent-layer states without requiring a global target or additional forward/backward passes; this leaves open whether the update remains strictly local as required by the original EP formulation in Eq. (3).
[§5.1, Table 2] §5.1, Table 2: The reported SOTA accuracy on CIFAR-100 (e.g., 78.4% for VGG-16) lacks ablation controls isolating the contribution of the intermediate signals versus standard EP with increased iterations or different hyperparameters; without these, it is unclear whether the scalability improvement is attributable to the proposed signals or to other implementation choices.
[§4.3] §4.3: The convergence analysis for the augmented energy function does not quantify the additional computational overhead of computing and propagating the auxiliary signals at each layer, which is necessary to substantiate that the method retains EP's efficiency advantage over BPTT for deep networks.

minor comments (2)

[Figure 3] Figure 3 caption: The legend for the convergence curves does not specify the exact number of phases or the value of the learning rate schedule used in the deep VGG experiments.
[§2] §2: The related-work discussion on prior EP extensions omits recent work on local learning rules for vision transformers; adding these references would better contextualize the novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.2, Eq. (8)] §3.2, Eq. (8): The layer-wise error signal is defined using a distillation loss between student and teacher activations, but the manuscript does not derive or demonstrate that this signal can be computed exclusively from adjacent-layer states without requiring a global target or additional forward/backward passes; this leaves open whether the update remains strictly local as required by the original EP formulation in Eq. (3).

Authors: We thank the referee for this insightful comment on the locality of the updates. The layer-wise error signal in Eq. (8) is computed using only the states from adjacent layers during the two phases of EP, without needing the global target or additional passes. The teacher activations are generated locally from the equilibrium states of neighboring layers. We will include a formal derivation demonstrating the strict locality in the revised Section 3.2 to address this concern explicitly. revision: yes
Referee: [§5.1, Table 2] §5.1, Table 2: The reported SOTA accuracy on CIFAR-100 (e.g., 78.4% for VGG-16) lacks ablation controls isolating the contribution of the intermediate signals versus standard EP with increased iterations or different hyperparameters; without these, it is unclear whether the scalability improvement is attributable to the proposed signals or to other implementation choices.

Authors: The referee correctly identifies a gap in our experimental validation. While Table 2 reports the SOTA results, we acknowledge that additional ablations are needed to isolate the impact of the intermediate signals. We will add comprehensive ablation studies in the revised manuscript, including comparisons with standard EP under equivalent computational budgets and varied hyperparameters, to clearly attribute the scalability improvements to our proposed method. revision: yes
Referee: [§4.3] §4.3: The convergence analysis for the augmented energy function does not quantify the additional computational overhead of computing and propagating the auxiliary signals, which is necessary to substantiate that the method retains EP's efficiency advantage over BPTT for deep networks.

Authors: We agree that quantifying the computational overhead is important for substantiating the efficiency claims. The analysis in §4.3 establishes convergence but does not include overhead metrics. In the revision, we will augment this section with both theoretical bounds and empirical measurements of the additional cost incurred by the auxiliary signals, demonstrating that the overall complexity remains favorable compared to BPTT for deep convolutional CRNNs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; auxiliary signals introduced as independent extension

full rationale

The paper proposes integrating layer-wise auxiliary supervision signals and knowledge distillation into Equilibrium Propagation to mitigate vanishing gradients in deep CRNNs, enabling training of VGG-scale architectures on CIFAR-10/100. No load-bearing step reduces by construction to a prior fitted quantity or self-citation chain: the abstract frames the auxiliary signals as a novel addition that enhances neuron dynamics convergence while preserving local two-phase updates, rather than re-deriving EP gradients or performance metrics from the inputs themselves. The derivation chain remains self-contained because the claimed scalability rests on the empirical integration of new signals, not on renaming or fitting existing EP results. This is the expected honest non-finding for a proposal paper that adds mechanisms instead of claiming a first-principles equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unstated premise that neuron dynamics in CRNNs will converge under the added signals and that the resulting gradients remain aligned with BPTT; no explicit free parameters or invented entities are detailed in the abstract.

axioms (1)

domain assumption Equilibrium Propagation gradients closely align with BPTT in convergent recurrent networks
Invoked in the abstract as the basis for EP's utility; this is a standard assumption in the EP literature but not re-derived here.

invented entities (1)

layer-wise learning signals no independent evidence
purpose: Provide auxiliary supervision to alleviate vanishing gradients in deep EP
New mechanism introduced to enable depth scaling; no independent evidence such as a predicted observable outside the training loop is mentioned.

pith-pipeline@v0.9.0 · 5743 in / 1277 out tokens · 35028 ms · 2026-05-18T21:18:27.096026+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To alleviate the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates layer-wise learning signals... first integration of knowledge distillation and local error signals into EP
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The gradient of the objective function L with respect to w can be estimated by computing the divergence of the two stable states

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

[1]

In 2023 32nd International Joint Conference on Artificial Intelligence (IJCAI)

Sequence Learning using Equilibrium Propagation. In 2023 32nd International Joint Conference on Artificial Intelligence (IJCAI). Bi, G.-q.; and Poo, M.-m

work page 2023
[2]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. In Proceedings of the 2019 conference of the North American chapter of the association for compu- tational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186. Ernoult, M.; Grollier, J.; Querlioz, D.; Bengio, Y .; and Scel- lier, B

work page 2019
[3]

Ernoult, J

Equilibrium propagation with continual weight updates. arXiv:2005.04168. Frenkel, C.; Lefebvre, M.; and Bol, D

work page arXiv 2005
[4]

Gaussian Error Linear Units (GELUs)

Gaussian error linear units (gelus). arXiv:1606.08415. Hinton, G.; Vinyals, O.; and Dean, J

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network. arXiv:1503.02531. Hochreiter, S

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167. Khamsi, M. A.; and Kirk, W. A

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv:2302.10886

Some fundamen- tal aspects about lipschitz continuity of neural networks. arXiv:2302.10886. Krizhevsky, A.; Hinton, G.; et al

work page arXiv
[8]

In 2024 International Conference on Neuromor- phic Systems (ICONS), 312–318

Scaling snns trained using equilibrium propagation to convolutional ar- chitectures. In 2024 International Conference on Neuromor- phic Systems (ICONS), 312–318. IEEE. Loshchilov, I.; and Hutter, F

work page 2024
[9]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts. arXiv:1608.03983. Mao, A.; Mohri, M.; and Zhong, Y

work page internal anchor Pith review Pith/arXiv arXiv
[10]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Pytorch: An imperative style, high- performance deep learning library. arXiv:1912.01703. Pineda, F

work page internal anchor Pith review Pith/arXiv arXiv 1912
[11]

In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241

U-net: Convolutional networks for biomedical image segmenta- tion. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241. Springer. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G

work page 2015
[12]

A Vanishing Gradient Problem In this section, Figure 5 demonstrates the neuron states within a VGG-13 trained by both the standard EP and aug- mented EP frameworks

Collective dynamics of ‘small-world’networks.nature, 393(6684): 440–442. A Vanishing Gradient Problem In this section, Figure 5 demonstrates the neuron states within a VGG-13 trained by both the standard EP and aug- mented EP frameworks. In the standard EP setting, the neu- ron activations gradually decay to zero over epochs, result- ing in the vanishing ...

work page 2009
[13]

We define each sched- uler as a function of epochE over the total number of epochs Etotal. κlin = −κinit E Etotal + κinit κexp = κinit · e−γE κcos = κmin + 0.5 · (κinit − κmin) · (1 + cos(π · E /Etotal)) (14) Here, κlin, κexp, and κcos denote the learning signal mag- nitudes κ corresponding to the linear scheduler, exponen- tial scheduler, and cosine anne...

work page 2008

[1] [1]

In 2023 32nd International Joint Conference on Artificial Intelligence (IJCAI)

Sequence Learning using Equilibrium Propagation. In 2023 32nd International Joint Conference on Artificial Intelligence (IJCAI). Bi, G.-q.; and Poo, M.-m

work page 2023

[2] [2]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. In Proceedings of the 2019 conference of the North American chapter of the association for compu- tational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186. Ernoult, M.; Grollier, J.; Querlioz, D.; Bengio, Y .; and Scel- lier, B

work page 2019

[3] [3]

Ernoult, J

Equilibrium propagation with continual weight updates. arXiv:2005.04168. Frenkel, C.; Lefebvre, M.; and Bol, D

work page arXiv 2005

[4] [4]

Gaussian Error Linear Units (GELUs)

Gaussian error linear units (gelus). arXiv:1606.08415. Hinton, G.; Vinyals, O.; and Dean, J

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network. arXiv:1503.02531. Hochreiter, S

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167. Khamsi, M. A.; and Kirk, W. A

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

arXiv:2302.10886

Some fundamen- tal aspects about lipschitz continuity of neural networks. arXiv:2302.10886. Krizhevsky, A.; Hinton, G.; et al

work page arXiv

[8] [8]

In 2024 International Conference on Neuromor- phic Systems (ICONS), 312–318

Scaling snns trained using equilibrium propagation to convolutional ar- chitectures. In 2024 International Conference on Neuromor- phic Systems (ICONS), 312–318. IEEE. Loshchilov, I.; and Hutter, F

work page 2024

[9] [9]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts. arXiv:1608.03983. Mao, A.; Mohri, M.; and Zhong, Y

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Pytorch: An imperative style, high- performance deep learning library. arXiv:1912.01703. Pineda, F

work page internal anchor Pith review Pith/arXiv arXiv 1912

[11] [11]

In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241

U-net: Convolutional networks for biomedical image segmenta- tion. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241. Springer. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G

work page 2015

[12] [12]

A Vanishing Gradient Problem In this section, Figure 5 demonstrates the neuron states within a VGG-13 trained by both the standard EP and aug- mented EP frameworks

Collective dynamics of ‘small-world’networks.nature, 393(6684): 440–442. A Vanishing Gradient Problem In this section, Figure 5 demonstrates the neuron states within a VGG-13 trained by both the standard EP and aug- mented EP frameworks. In the standard EP setting, the neu- ron activations gradually decay to zero over epochs, result- ing in the vanishing ...

work page 2009

[13] [13]

We define each sched- uler as a function of epochE over the total number of epochs Etotal. κlin = −κinit E Etotal + κinit κexp = κinit · e−γE κcos = κmin + 0.5 · (κinit − κmin) · (1 + cos(π · E /Etotal)) (14) Here, κlin, κexp, and κcos denote the learning signal mag- nitudes κ corresponding to the linear scheduler, exponen- tial scheduler, and cosine anne...

work page 2008