A projection-based framework for gradient-free and parallel learning

Andreas Bergmeister; Manish Krishan Lal; Stefanie Jegelka; Suvrit Sra

arxiv: 2506.05878 · v3 · pith:HWR2B3TFnew · submitted 2025-06-06 · 💻 cs.LG

A projection-based framework for gradient-free and parallel learning

Andreas Bergmeister , Manish Krishan Lal , Stefanie Jegelka , Suvrit Sra This is my paper

Pith reviewed 2026-05-21 23:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords projection methodsfeasibility problemsneural network traininggradient-free optimizationparallel learningconstraint satisfactionmachine learning

0 comments

The pith

Neural network training can be recast as iteratively projecting parameters onto local constraints from each elementary operation rather than minimizing a loss with gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that training amounts to solving a feasibility problem: locate network parameters and intermediate states that satisfy a set of local constraints, each tied to one basic operation such as a linear layer or activation. Instead of backpropagating gradients, the procedure applies projection operators that map any candidate point onto the nearest point satisfying a given constraint. Because each projection depends only on the inputs and outputs of its own operation, the steps can run independently and therefore in parallel. The approach requires no differentiability, so models may incorporate non-smooth or discrete components without modification.

Core claim

Training is equivalent to finding parameters and states that satisfy local constraints derived from the network's elementary operations; this feasibility problem is solved by composing and iterating projection operators that act locally on each operation, yielding a gradient-free procedure that is parallelizable by construction.

What carries the argument

Projection operators onto the local constraints derived from each elementary network operation, composed automatically to solve the overall feasibility problem.

If this is right

Projections act locally, so computation can be distributed across layers or operations without sequential dependencies.
Non-differentiable or discrete operations can be used directly because no derivative is required.
The same projection machinery supports training of MLPs, CNNs, and RNNs on established benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The local nature of projections suggests the method could scale to networks too large for standard gradient communication patterns.
Because feasibility replaces loss minimization, the approach may naturally incorporate hard constraints on weights or activations that are difficult to enforce with penalties.
A direct test would compare iteration counts and wall-clock time on multi-device hardware against backpropagation for networks of increasing depth.

Load-bearing premise

Feasible solutions to the local constraints exist and correspond to models that achieve useful performance, and the individual projection steps remain efficient when composed across large networks.

What would settle it

A controlled experiment in which the projection method is applied to a standard MLP on MNIST yet produces test accuracy no better than random guessing after a fixed number of iterations, while a gradient-based baseline reaches high accuracy under identical conditions.

read the original abstract

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper recasts NN training as a feasibility problem solved by local projections and ships PJAX to implement it, but convergence on non-convex sets is unaddressed.

read the letter

The main point is that they've recast neural network training as finding parameters that meet a set of local feasibility constraints from each operation, solved by composing and iterating projection operators, and they built PJAX in JAX to do this automatically. What's new here is the feasibility framing and the way they derive projection operators for things like affine layers and activations. It's different from gradient descent and allows inherent parallelism because each projection is local. They demonstrate it by training MLPs, CNNs, and RNNs on benchmarks. It does a decent job showing the concept works at least for these cases and highlighting the benefits for non-differentiable ops and distributed setups. The soft spot is the convergence question. Since the constraint sets are non-convex for most activations, there's no guarantee that the iterations will find a feasible point that corresponds to a trained model, and the paper doesn't seem to provide analysis or failure case experiments to address this. The abstract mentions success but without numbers or comparisons it's hard to judge how competitive it is. This is for folks interested in new ways to train networks outside the usual backprop paradigm, especially if they care about parallelism or avoiding gradients. A reader experimenting with alternative optimizers might find the PJAX code useful to play with. I'd send it to peer review. The idea is distinct enough and the implementation concrete enough that referees can evaluate the practical side and push for better evidence on reliability.

Referee Report

2 major / 1 minor

Summary. The paper proposes reformulating neural network training as a large-scale feasibility problem: finding parameters and intermediate states that satisfy local constraints derived from elementary operations (affine transforms, activations, etc.). These constraints are solved via iterative composition of projection operators, implemented in the PJAX JAX-based library that automatically derives the projection solution operators. The authors demonstrate the method on MLPs, CNNs, and RNNs trained on standard benchmarks, claiming advantages in parallelism and support for non-differentiable operations.

Significance. If the iterative projections reliably converge to feasible points yielding competitive models, the framework would provide a genuinely gradient-free training paradigm with built-in parallelism across layers or operations and native handling of non-differentiable components. The open-source PJAX implementation with NumPy-like API, GPU/TPU support, and extensibility constitutes a concrete, reproducible contribution that could enable new lines of research on projection-based optimization for machine learning.

major comments (2)

[Experiments] The central claim that the method constitutes a viable training procedure rests on the assumption that iterated projections onto the (typically non-convex) local constraint sets reach feasible points corresponding to useful models. Standard POCS or alternating-projection convergence results do not apply to non-convex sets, yet the manuscript provides neither a convergence analysis nor residual-norm plots or failure-mode experiments in the Experiments section.
[Experiments] The abstract and reported results assert that the approach is a 'compelling alternative' to gradient-based training, but no quantitative performance numbers (accuracy, loss, wall-clock time), direct comparisons against SGD/Adam, or scaling behavior with network depth/width are supplied, leaving the practical advantages unsubstantiated.

minor comments (1)

[Method] Notation for the composed projection operator and the distinction between parameter and state variables could be clarified with a small running example early in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address each of the major comments below and describe the revisions planned for the next version of the paper.

read point-by-point responses

Referee: [Experiments] The central claim that the method constitutes a viable training procedure rests on the assumption that iterated projections onto the (typically non-convex) local constraint sets reach feasible points corresponding to useful models. Standard POCS or alternating-projection convergence results do not apply to non-convex sets, yet the manuscript provides neither a convergence analysis nor residual-norm plots or failure-mode experiments in the Experiments section.

Authors: We recognize that standard convergence theorems for projection methods do not extend directly to the non-convex constraint sets encountered in neural network training. The manuscript prioritizes the introduction of the feasibility framework and the PJAX implementation over theoretical analysis. To provide empirical evidence of convergence, we will add residual-norm plots and failure-mode analysis to the Experiments section in the revised manuscript. revision: partial
Referee: [Experiments] The abstract and reported results assert that the approach is a 'compelling alternative' to gradient-based training, but no quantitative performance numbers (accuracy, loss, wall-clock time), direct comparisons against SGD/Adam, or scaling behavior with network depth/width are supplied, leaving the practical advantages unsubstantiated.

Authors: The demonstrations in the manuscript focus on the applicability of the method to various architectures rather than exhaustive performance benchmarking. We agree that including quantitative comparisons would better support the claims. In the revision, we will incorporate accuracy and loss metrics, comparisons to SGD and Adam, as well as scaling experiments with respect to network depth and width. revision: yes

Circularity Check

0 steps flagged

No circularity: feasibility reformulation uses established projections on independently derived local constraints

full rationale

The paper's core chain reformulates training as finding parameters satisfying local constraints from elementary operations (affine transforms, activations), then applies iterative projections via PJAX. These constraints are defined directly from the network's forward operations without reference to the final trained model or loss values, and the projection steps rely on standard algorithms rather than any fitted parameter or self-citation that encodes the target result. Empirical training of MLPs/CNNs/RNNs on benchmarks provides independent verification outside the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the ability to define and compose projection operators for standard neural network primitives; no free parameters, new entities, or additional axioms are introduced beyond standard optimization assumptions.

axioms (1)

domain assumption Local constraints derived from elementary operations admit feasible solutions that yield effective models.
The feasibility formulation assumes that solutions to the constraint system correspond to useful trained networks.

pith-pipeline@v0.9.0 · 5709 in / 1031 out tokens · 36626 ms · 2026-05-21T23:54:43.779713+00:00 · methodology

Review history (2 revisions) →

A projection-based framework for gradient-free and parallel learning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)