On the Infinite Width and Depth Limits of Predictive Coding Networks

El Mehdi Achour; Francesco Innocenti; Rafal Bogacz

arxiv: 2602.07697 · v2 · pith:DFFTQOG7new · submitted 2026-02-07 · 💻 cs.LG · cs.AI· cs.NE

On the Infinite Width and Depth Limits of Predictive Coding Networks

Francesco Innocenti , El Mehdi Achour , Rafal Bogacz This is my paper

Pith reviewed 2026-05-25 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords predictive codingbackpropagationinfinite width limitsresidual networkslocal learning rulesenergy minimizationneural network training

0 comments

The pith

Predictive coding networks match backpropagation gradients exactly in wide linear residual networks under the same stable parameterizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the infinite width and depth limits of predictive coding networks to determine which parameterizations allow stable deep training. For linear residual networks it proves that the feature-learning parameterizations stable in both width and depth are identical to those already known for backpropagation. Under any such parameterization the equilibrated predictive-coding energy converges to the quadratic backpropagation loss once width greatly exceeds depth, so the two methods produce identical gradients. Experiments confirm the same convergence for nonlinear models including convolutional networks and transformers, provided activities reach equilibrium before each weight update. The result therefore restricts the scalable choices for predictive coding to those already validated for backpropagation while showing that backpropagation itself can be realized through strictly local updates in sufficiently wide architectures.

Core claim

For linear residual networks, the set of width- and depth-stable feature-learning parameterisations for predictive coding is exactly the same as for backpropagation. Under any of these parameterisations, the predictive-coding energy with equilibrated activities converges to the quadratic backpropagation loss when the model width is much larger than the depth, resulting in predictive coding computing the same gradients as backpropagation. Experiments show that, as long as an activity equilibrium is reached, convergence to backpropagation holds for nonlinear models including convolutional networks and transformers.

What carries the argument

The convergence of the predictive-coding energy function (with activities at equilibrium) to the quadratic backpropagation loss in the infinite-width limit of linear residual networks.

If this is right

Only the parameterizations already known to permit stable deep backpropagation also permit stable deep predictive coding.
In networks whose width greatly exceeds their depth, predictive coding performs exactly the same computation as backpropagation using only local updates.
The same reparameterizations that stabilize backpropagation training also stabilize predictive coding training.
Predictive coding supplies a local-update mechanism that can realize backpropagation-style learning in architectures that are wide relative to their depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that any sufficiently wide cortical circuit using predictive coding could implement backpropagation-like credit assignment without explicit error back-propagation.
Whether real neural circuits maintain the required activity equilibrium during learning becomes a testable prediction of the theory.
The same convergence analysis could be applied to other local energy-minimization rules to check whether they recover backpropagation in the wide limit.

Load-bearing premise

That network activities reach equilibrium before each weight update during the predictive coding dynamics.

What would settle it

Direct numerical comparison of the weight gradients produced by predictive coding versus backpropagation on the same wide linear residual network after activity equilibration; mismatch would falsify the claimed convergence.

read the original abstract

Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these methods remain unclear. To address this gap, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the quadratic BP loss when the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that, as long as an activity equilibrium is reached, convergence to BP holds for nonlinear models including convolutional networks and transformers. Overall, this work constrains the types of parameterisation that are scalable with PC, while showing a way in which BP can be effectively implemented with only local updates in much wider than deep networks like the brain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PC and BP share identical stable parameterizations in the infinite limits for linear residuals, with energy convergence when width >> depth, but only conditional on activity equilibration.

read the letter

The core finding is that for linear residual networks the width- and depth-stable feature-learning parameterizations are exactly the same for predictive coding as for backprop. Under any of those, the equilibrated PC energy converges to the quadratic BP loss when width greatly exceeds depth, so the gradients match. This exact match plus the convergence statement is new relative to the BP scaling literature cited in the abstract. The paper also reports that the same pattern appears experimentally for nonlinear models including conv nets and transformers, again provided equilibrium is reached. The linear-case analysis looks clean on the points that are stated, and the constraint on viable parameterizations is a useful takeaway for anyone trying to scale PC. The central soft spot is the equilibration requirement. The claims are explicitly conditional on activities reaching a fixed point before the weight step, yet the abstract and stress-test note give no derivation showing that the PC dynamics (gradient descent on the energy) reach equilibrium in finite iterations whose number does not grow with depth. If that fails, the loss convergence and gradient equivalence do not hold. Experiments are reported only for runs that do equilibrate, so they do not test the boundary. The work is aimed at people studying local learning rules and infinite-width limits. Readers who care about whether PC can implement BP-like updates at scale will get value from the parameterization constraints. It is worth sending to a serious referee to check the limit derivations and the equilibration dynamics in detail.

Referee Report

2 major / 1 minor

Summary. The paper studies the infinite width and depth limits of predictive coding networks (PCNs). For linear residual networks, the set of width- and depth-stable feature-learning parameterizations for PC is shown to be identical to that for backpropagation (BP). Under these parameterizations, the PC energy with equilibrated activities converges to the quadratic BP loss when width ≫ depth, implying that PC computes the same gradients as BP. Experiments indicate that this convergence holds for nonlinear models (including CNNs and transformers) provided an activity equilibrium is reached during the PC dynamics.

Significance. If the central claims hold, the work supplies a precise characterization of the parameterizations that make PC scalable with depth and width, while establishing a regime in which local PC updates implement BP exactly. The explicit identification of the shared stable parameterizations and the width ≫ depth equivalence constitute a substantive theoretical contribution; the extension to nonlinear architectures via experiments further strengthens the result. These findings constrain the design space for biologically plausible learning rules and clarify the conditions under which PC can serve as a local implementation of BP.

major comments (2)

[Abstract] Abstract: The equivalence between PC and BP gradients is conditioned on the activities reaching equilibrium before each weight update. No derivation or bound is supplied showing that the number of PC iterations needed for equilibration remains finite and independent of depth under the identified parameterizations, nor that the infinite-width limit commutes with the equilibration step; this assumption is load-bearing for both the linear convergence claim and the nonlinear experimental conclusions.
[Linear residual analysis] Linear residual analysis (presumably §3–4): The demonstration that the stable feature-learning parameterizations coincide with those of BP is derived under the equilibrated-energy assumption. If the PC dynamics fail to equilibrate in finite steps when depth is large (even if width is larger), the claimed reduction of the PC energy to the quadratic BP loss does not hold for any finite network, weakening the practical implication that PC implements BP.

minor comments (1)

The experimental section would benefit from explicit reporting of the number of PC iterations used per weight update and a diagnostic confirming that the activity equilibrium condition was met (or quantifying the residual energy) across the tested architectures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for identifying the central role of the activity-equilibration assumption. We address each major comment below, agreeing where the manuscript is limited and indicating planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The equivalence between PC and BP gradients is conditioned on the activities reaching equilibrium before each weight update. No derivation or bound is supplied showing that the number of PC iterations needed for equilibration remains finite and independent of depth under the identified parameterizations, nor that the infinite-width limit commutes with the equilibration step; this assumption is load-bearing for both the linear convergence claim and the nonlinear experimental conclusions.

Authors: We agree that all equivalence claims are explicitly conditioned on activity equilibrium, as stated in the abstract ('as long as an activity equilibrium is reached') and in the experimental conclusions. The theoretical derivations for linear residual networks are performed under the equilibrated-energy assumption, which is standard in the PC literature. The manuscript supplies neither a bound on iteration count independent of depth nor a proof that the infinite-width limit commutes with equilibration. In the revision we will add a dedicated paragraph in the discussion section acknowledging this limitation, while noting that the identification of the shared width- and depth-stable parameterizations is independent of the equilibration dynamics and that the width ≫ depth regime empirically supports rapid equilibration in the reported experiments. revision: partial
Referee: [Linear residual analysis] Linear residual analysis (presumably §3–4): The demonstration that the stable feature-learning parameterizations coincide with those of BP is derived under the equilibrated-energy assumption. If the PC dynamics fail to equilibrate in finite steps when depth is large (even if width is larger), the claimed reduction of the PC energy to the quadratic BP loss does not hold for any finite network, weakening the practical implication that PC implements BP.

Authors: Sections 3 and 4 derive the coincidence of stable parameterizations and the reduction to the quadratic BP loss under the assumption of equilibrated activities. For any finite network that has not reached equilibrium the exact reduction does not hold, and the manuscript does not claim otherwise. The core theoretical contribution is the characterization of the parameterizations that remain stable in the infinite-width and infinite-depth limits; the width ≫ depth equivalence to BP is presented as a limiting statement under equilibration. We will revise the text in §4 and the conclusion to state more explicitly that practical use requires a sufficient number of PC iterations to approximate equilibrium and that the paper provides no guarantee of finite-step equilibration independent of depth. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation self-contained via limit analysis

full rationale

The paper derives equivalence between PC and BP parameterizations and loss convergence for linear residual networks through explicit infinite-width/depth limit arguments on the energy function and equilibrated activities. These steps rely on standard scaling analysis and fixed-point convergence rather than any self-definition, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to prior unverified assertions by the same authors. The equilibration assumption is stated explicitly as a precondition but does not create a definitional loop within the presented math. The overall chain remains independent of its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides limited detail on assumptions; primary domain assumption is equilibrium of activities, with standard mathematical limits for width and depth.

axioms (1)

domain assumption Activity equilibrium is reached in the PC dynamics
Required for convergence statements and explicitly conditioned in the experimental claims.

pith-pipeline@v0.9.0 · 5736 in / 1187 out tokens · 38310 ms · 2026-05-25T06:54:30.045503+00:00 · methodology

On the Infinite Width and Depth Limits of Predictive Coding Networks

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)