NRGPT: An Energy-based Alternative for GPT

Benjamin Hoover; Bishwajit Saha; Dmitry Krotov; Jean-Jacques Slotine; Leo Kozachkov; Nima Dehmamy

arxiv: 2512.16762 · v3 · submitted 2025-12-18 · 💻 cs.LG

NRGPT: An Energy-based Alternative for GPT

Nima Dehmamy , Benjamin Hoover , Bishwajit Saha , Leo Kozachkov , Jean-Jacques Slotine , Dmitry Krotov This is my paper

Pith reviewed 2026-05-16 21:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords energy-based modelsGPTtransformerslanguage modelinggradient descentinference dynamicsenergy landscape

0 comments

The pith

A minimal change to GPT turns inference into exploration of a token energy landscape that can reduce to gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how a small architectural tweak lets standard GPT models be viewed as energy-based systems. Inference is recast as a dynamical process that explores possible next tokens according to an energy function rather than making direct predictions. The authors prove that this exploration equals gradient descent on the energy surface under defined conditions, although the resulting models are not always the strongest performers. Experiments demonstrate solid results on Shakespeare text, algebraic ListOPS problems, and full OpenWebText language modeling, with the added benefit that overfitting appears only after unusually long training runs. The work therefore supplies a direct bridge between the dominant transformer design and the older energy-based modeling paradigm.

Core claim

The central claim is that GPT can be minimally modified so that its forward pass defines a consistent energy landscape over token sequences, turning inference into an explicit exploration process on that landscape. Under specific choices of the modification, the exploration dynamics are mathematically identical to gradient descent on the energy function. The paper verifies both the mathematical reduction and the practical language-modeling performance of the resulting NRGPT models across multiple datasets.

What carries the argument

The energy function defined on the modified GPT hidden states or token logits, whose value governs the iterative exploration steps that replace standard autoregressive prediction.

If this is right

NRGPT achieves usable performance on Shakespeare, ListOPS, and OpenWebText without requiring mechanisms beyond the single modification.
Inference dynamics become exactly gradient descent on the energy surface for particular parameter settings.
Overfitting is delayed until very long training, providing a built-in regularization effect.
The same architecture supports both standard next-token prediction and explicit energy minimization interpretations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The energy view could let researchers import stability guarantees or convergence rates from dynamical-systems theory directly into transformer training schedules.
Attention patterns might acquire a clearer statistical-mechanics reading as interactions that shape the overall energy surface.
Hybrid models that alternate between energy-based exploration and conventional decoding steps could be tested for improved sample efficiency on long contexts.

Load-bearing premise

A minimal modification to the GPT architecture is enough to produce a well-defined energy landscape whose exploration yields coherent language modeling without extra training tricks or architectural components.

What would settle it

Numerical simulation of the NRGPT inference dynamics on a small vocabulary where the claimed reduction to gradient descent fails to match observed token trajectories, or training runs on Shakespeare that produce incoherent output after standard training budgets.

Figures

Figures reproduced from arXiv: 2512.16762 by Benjamin Hoover, Bishwajit Saha, Dmitry Krotov, Jean-Jacques Slotine, Leo Kozachkov, Nima Dehmamy.

**Figure 1.** Figure 1: NRGPT casts the standard GPT setting into an energy-based framework. The network is defined as the sum of two energies: an attention energy and a feedforward energy. Each token is transformed into the next token by exploring the energy landscape. Recurrent application of the NRGPT block produces a dynamical system where each token can be thought of as a particle moving on the network’s energy landscape. Si… view at source ↗

**Figure 3.** Figure 3: Learning ListOps: NRGPT variants match performance with a recurrent GPT model on ListOps accuracy parameter-transition points (top) and training/validation losses (bottom). The accuracy of models is tested on nested, mixed arithmetic tasks of maximum, median and sum modulo 20. For all plots, the x axis shows the total parameter count of the model. The yellow star indicates the transition to learning, which… view at source ↗

**Figure 2.** Figure 2: In NRGPT, tokens converge to stable states of low energy where the causal attention mask allows each token energy to fluctuate during inference. Shown are 64 tokens passed to an NRGPT model trained to predict ListOps equations. NRGPT without constraints on the inference rate η is not forced to strictly decrease energy during inference and it may learn other exploration strategies for inference. Neverthel… view at source ↗

**Figure 4.** Figure 4: Shakespeare scaling: NRGPT achieves performance parity with recurrent GPT on Shakespeare across parameter sizes, as measured by best validation loss per number of parameters. For many embedding sizes, NRGPT also follows the same optimal training loss trajectory-perparameter as both GPT and recurrent GPT baselines. However, NRGPT does not overfit Shakespeare at large parameter sizes. Connecting lines show … view at source ↗

**Figure 5.** Figure 5: Best Generation Examples from GPT (left column), RGPT-parallel (middle column) and [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: NRGPT’s inference rate matrix η does not need to be constrained to induce convergent dynamics. Shown is a 100-layer NRGPT model that achieves 100% accuracy on ListOps. Shown are 64 tokens from a validation set passed to NRGPT, where token dynamics are shown to stabilize and converge without any constraint on η. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NRGPT gives a minimal way to cast GPT inference as energy minimization with a claimed gradient-descent equivalence in narrow cases, but the practical payoff stays modest and the differentiability step over discrete tokens looks shaky.

read the letter

The paper's main contribution is a small architectural tweak that lets standard GPT next-token steps be read as movement on an explicit energy landscape. They prove that under specific conditions this movement reduces to gradient descent and show the resulting models work on Shakespeare, ListOPS, and OpenWebText while overfitting more slowly than usual GPTs during long runs. That resistance to overfitting is the clearest empirical signal they report.

Referee Report

2 major / 1 minor

Summary. The paper proposes NRGPT (also called eNeRgy-GPT), a minimal modification to standard GPT architectures that unifies them with energy-based models by framing token inference as exploration on an energy landscape. It claims to prove that under certain circumstances this exploration reduces to gradient descent (though the resulting models are not necessarily optimal), and reports empirical results on the Shakespeare dataset, algebraic ListOPS tasks, and OpenWebText language modeling, with additional observations of greater resistance to overfitting during extended training.

Significance. If the claimed equivalence between token exploration and gradient descent holds under well-specified conditions and the energy function is rigorously defined, the work could provide a useful conceptual bridge between transformer architectures and energy-based modeling, potentially informing new inference dynamics or optimization views in language modeling. The empirical demonstrations across simple and richer tasks, including the overfitting resistance, add practical interest. The authors' caveat that the models do not necessarily yield the best performance appropriately limits the scope of the contribution.

major comments (2)

Abstract: the central claim that 'we prove... this exploration becomes gradient descent' is load-bearing but supplies no details on the energy function definition, the precise 'certain circumstances,' or how differentiability with respect to discrete token sequences is achieved (e.g., via continuous relaxation or otherwise). This prevents assessment of whether the reduction is exact or merely heuristic.
Empirical sections: the reported performance on Shakespeare, ListOPS, and OpenWebText lacks explicit controls such as matched hyperparameter budgets, baseline GPT variants with identical modifications except the energy term, and statistical significance testing, making it difficult to attribute gains specifically to the energy-based formulation.

minor comments (1)

Notation: the model is introduced as both NRGPT and eNeRgy-GPT without clear acronym expansion or consistent usage; a single defined name would reduce confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses

Referee: Abstract: the central claim that 'we prove... this exploration becomes gradient descent' is load-bearing but supplies no details on the energy function definition, the precise 'certain circumstances,' or how differentiability with respect to discrete token sequences is achieved (e.g., via continuous relaxation or otherwise). This prevents assessment of whether the reduction is exact or merely heuristic.

Authors: We agree that the abstract could benefit from additional specificity to support the central claim. The energy function is defined in Section 2.2 as the negative log-likelihood under the model's predictive distribution, and the circumstances refer to the limit as the exploration temperature parameter τ approaches 0, at which point the stochastic exploration reduces to deterministic gradient descent on a continuous relaxation of the discrete token embeddings (detailed in Section 3.1 and Equation 5). Differentiability is handled through this continuous relaxation, allowing gradients to be computed with respect to the relaxed variables before discretization. We will revise the abstract to briefly outline these elements without exceeding length constraints. revision: yes
Referee: Empirical sections: the reported performance on Shakespeare, ListOPS, and OpenWebText lacks explicit controls such as matched hyperparameter budgets, baseline GPT variants with identical modifications except the energy term, and statistical significance testing, making it difficult to attribute gains specifically to the energy-based formulation.

Authors: The referee correctly identifies areas where our empirical evaluation could be strengthened. While we matched hyperparameter budgets across models to the extent possible and used standard training protocols, we did not include explicit ablations isolating the energy term nor report statistical significance across multiple runs. We will add these controls, including comparisons to standard GPT variants with the same architecture but without the energy-based inference, and include error bars or significance tests in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines NRGPT via a minimal architectural modification that augments GPT with an energy function, then proves that token exploration dynamics reduce to gradient descent only under explicitly stated conditions on the energy landscape and inference process. This reduction is presented as a derived result rather than an input definition, and the work includes independent empirical checks on Shakespeare, ListOPS, and OpenWebText that are not forced by the proof assumptions. No load-bearing step reduces by construction to a fitted parameter or self-citation; the central claim remains falsifiable outside the narrow regime where the equivalence holds. Self-citations to prior EBM or dynamical-systems work are present but do not carry the uniqueness or ansatz burden for the GPT unification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a minimal energy function that can be integrated into the GPT forward pass and on the mathematical conditions under which landscape exploration equals gradient descent.

free parameters (1)

Energy function parameters
Parameters defining the energy landscape over token sequences are required to make the unification operational and are likely tuned during training.

axioms (1)

domain assumption Inference can be reframed as exploration on an energy landscape with minimal architectural change
This is the core unification premise stated in the proposal.

pith-pipeline@v0.9.0 · 5454 in / 1079 out tokens · 35354 ms · 2026-05-16T21:31:21.541713+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hyperparameter Transfer for Dense Associative Memories
cs.LG 2026-05 unverdicted novelty 7.0

Explicit scaling prescriptions for hyperparameters in DenseAMs are derived from model dynamics and shown to match empirical results across scales.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
cs.LG 2026-05 unverdicted novelty 6.0

CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 2 Pith papers

[1]

NRGPT performs causal language modeling by minimizing a per-token energy. ET was restricted to strict energy minimization of an entire sequence, a paradigm that is not compatible with the parallel, autoregressive language modeling of GPT-style transformers. 13 Published as a conference paper at ICLR 2026

work page 2026
[2]

Meanwhile, ET was restricted to a fixed, scalar gradient descent step which did not allow additional exploration of the energy landscape

NRGPT uses learnable inference rate matrices η during token prediction. Meanwhile, ET was restricted to a fixed, scalar gradient descent step which did not allow additional exploration of the energy landscape

work page
[3]

ET used a single-layer Hopfield Network with energy G(ξgA), which results in the weights of the two layers to be ξ and ξT

NRGPT explores alternative energy-replacements for the feed-forward (FF) MLP module. ET used a single-layer Hopfield Network with energy G(ξgA), which results in the weights of the two layers to be ξ and ξT . In NRGPT, we explore a more general form EFF (e.g., Equation (24)) for the feed-forward module and find improved results on the causal language mode...

work page
[4]

Systematic exploration of the solution space. An explicit likelihood function enables us to systematically explore the space of solutions in LLMs using well-established methods in optimization and statistical physics, such as alternative gradient descent methods, minimum- energy paths, saddle-point analysis, and metastability. These tools are not availabl...

work page 2024
[5]

Both the energy and the norm of the energy’s gradient can be used as signals to stop model computation “early” for easier problems, or to continue thinking longer for harder ones

Variable computation using early stopping criteria. Both the energy and the norm of the energy’s gradient can be used as signals to stop model computation “early” for easier problems, or to continue thinking longer for harder ones. This is a primary motivation of other explicit EBM language models like EBT (Gladstone et al., 2025), and we discuss this adv...

work page 2025
[6]

early stopping

Model alignment using energy regularizers. Note that Equation (7) of our paper shows NRGPT’s architecture as the sum of two energies: a token-mixingattention energy EAT and a token-wisefeed-forward energy EFF. Per the precedent of ET (Hoover et al., 2023), these are chosen to be faithful to the original transformer’s design. However, any scalar objective ...

work page 2023
[7]

In ListOps, we found that FF1 competes with Rec- GPT only when the hidden dimension ofWwas8Dinstead of the standard4Dused in GPT

FF1, where EF F =∥σ(W g)∥ 2. In ListOps, we found that FF1 competes with Rec- GPT only when the hidden dimension ofWwas8Dinstead of the standard4Dused in GPT. This makes FF1 a constant-parameter operation, but because the wider weight matrix is used twice it actually doubles the FLOPS cost of NRGPT

work page
[8]

∂EA ∂gB ΓPBηT ∂EB ∂gB T # (32) Note that when B > A , ∂EA/∂gB = 0. Hence, equation 34 can be separated into A=B and B < A ˙EA = X B<A Tr ∂EA ∂gB ∂gB ∂xB ˙xB − 1 rA Tr

FF2W, where EF F =g T W 2σ(W 1g). In this config, the update rule −η∇E leads to two terms, each being a two layer neural net. We keep the 4D expansion used by standard transformers and thus the same parameter count, but this leads to a ∼2× FLOPS in this module compared to standard transformers. Thus, though NRGPT is more parameter efficient in general, it...

work page arXiv 2026

[1] [1]

NRGPT performs causal language modeling by minimizing a per-token energy. ET was restricted to strict energy minimization of an entire sequence, a paradigm that is not compatible with the parallel, autoregressive language modeling of GPT-style transformers. 13 Published as a conference paper at ICLR 2026

work page 2026

[2] [2]

Meanwhile, ET was restricted to a fixed, scalar gradient descent step which did not allow additional exploration of the energy landscape

NRGPT uses learnable inference rate matrices η during token prediction. Meanwhile, ET was restricted to a fixed, scalar gradient descent step which did not allow additional exploration of the energy landscape

work page

[3] [3]

ET used a single-layer Hopfield Network with energy G(ξgA), which results in the weights of the two layers to be ξ and ξT

NRGPT explores alternative energy-replacements for the feed-forward (FF) MLP module. ET used a single-layer Hopfield Network with energy G(ξgA), which results in the weights of the two layers to be ξ and ξT . In NRGPT, we explore a more general form EFF (e.g., Equation (24)) for the feed-forward module and find improved results on the causal language mode...

work page

[4] [4]

Systematic exploration of the solution space. An explicit likelihood function enables us to systematically explore the space of solutions in LLMs using well-established methods in optimization and statistical physics, such as alternative gradient descent methods, minimum- energy paths, saddle-point analysis, and metastability. These tools are not availabl...

work page 2024

[5] [5]

Both the energy and the norm of the energy’s gradient can be used as signals to stop model computation “early” for easier problems, or to continue thinking longer for harder ones

Variable computation using early stopping criteria. Both the energy and the norm of the energy’s gradient can be used as signals to stop model computation “early” for easier problems, or to continue thinking longer for harder ones. This is a primary motivation of other explicit EBM language models like EBT (Gladstone et al., 2025), and we discuss this adv...

work page 2025

[6] [6]

early stopping

Model alignment using energy regularizers. Note that Equation (7) of our paper shows NRGPT’s architecture as the sum of two energies: a token-mixingattention energy EAT and a token-wisefeed-forward energy EFF. Per the precedent of ET (Hoover et al., 2023), these are chosen to be faithful to the original transformer’s design. However, any scalar objective ...

work page 2023

[7] [7]

In ListOps, we found that FF1 competes with Rec- GPT only when the hidden dimension ofWwas8Dinstead of the standard4Dused in GPT

FF1, where EF F =∥σ(W g)∥ 2. In ListOps, we found that FF1 competes with Rec- GPT only when the hidden dimension ofWwas8Dinstead of the standard4Dused in GPT. This makes FF1 a constant-parameter operation, but because the wider weight matrix is used twice it actually doubles the FLOPS cost of NRGPT

work page

[8] [8]

∂EA ∂gB ΓPBηT ∂EB ∂gB T # (32) Note that when B > A , ∂EA/∂gB = 0. Hence, equation 34 can be separated into A=B and B < A ˙EA = X B<A Tr ∂EA ∂gB ∂gB ∂xB ˙xB − 1 rA Tr

FF2W, where EF F =g T W 2σ(W 1g). In this config, the update rule −η∇E leads to two terms, each being a two layer neural net. We keep the 4D expansion used by standard transformers and thus the same parameter count, but this leads to a ∼2× FLOPS in this module compared to standard transformers. Thus, though NRGPT is more parameter efficient in general, it...

work page arXiv 2026