Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Germain Vivier-Ardisson; Mathieu Blondel; Michael E. Sander; Tianlin Liu; Vincent Roulet

arxiv: 2512.15605 · v4 · pith:NZLN6I2Xnew · submitted 2025-12-17 · 💻 cs.LG · stat.ML

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Mathieu Blondel , Michael E. Sander , Germain Vivier-Ardisson , Tianlin Liu , Vincent Roulet This is my paper

Pith reviewed 2026-05-16 21:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords autoregressive modelsenergy-based modelschain rulebijectionsoft Bellman equationmaximum entropy RLnext-token predictionlookahead

0 comments

The pith

Autoregressive language models are equivalent to energy-based models in function space through a bijection from the chain rule of probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive models, trained only on next-token prediction, match energy-based models that score entire sequences by energy. This match follows directly from factoring joint probabilities into conditionals via the chain rule. A reader would care because the equivalence explains how next-token training can still produce planning behavior that aligns with optimal policies in maximum-entropy reinforcement learning. The same link also equates supervised training of the two model classes and supplies error bounds when distilling an energy-based model into an autoregressive one.

Core claim

Taking the chain rule of probability as starting point yields an explicit bijection between autoregressive models and energy-based models in function space; this bijection is a special case of the soft Bellman equation from maximum-entropy reinforcement learning. Supervised learning on next-token prediction is therefore equivalent to learning the corresponding energy-based model, and theoretical error bounds exist for distilling energy-based models into autoregressive ones. The result supplies a concrete account of why next-token predictors can exhibit lookahead despite their local training objective.

What carries the argument

The explicit bijection in function space that maps the autoregressive chain-rule factorization to an energy function, shown to be identical to a special case of the soft Bellman equation.

If this is right

Supervised learning of autoregressive models is formally identical to supervised learning of the corresponding energy-based models.
Distillation of an energy-based model into an autoregressive model admits explicit theoretical error bounds.
Next-token prediction can recover the same lookahead behavior as optimal policies in maximum-entropy reinforcement learning.
The soft-Bellman correspondence lets any analysis of one model class transfer directly to the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Techniques developed for sampling from energy-based models, such as MCMC, could be repurposed to improve autoregressive decoding on tasks that require global consistency.
Alignment methods that optimize energy-based objectives may be applied to autoregressive models without architectural change.
The unification suggests that chain-of-thought prompting in language models works by implicitly minimizing an energy over future tokens.

Load-bearing premise

The chain rule of probability directly produces the claimed bijection in function space with no further restrictions on model capacity, training dynamics, or the form of the energy function.

What would settle it

A simple finite vocabulary and short sequence length where the next-token probabilities of a trained autoregressive model cannot be rearranged into a sequence-level energy function that satisfies the soft Bellman optimality condition.

Figures

Figures reproduced from arXiv: 2512.15605 by Germain Vivier-Ardisson, Mathieu Blondel, Michael E. Sander, Tianlin Liu, Vincent Roulet.

**Figure 1.** Figure 1: Summary of mappings discussed in this paper. requiring Markov-chain Monte-Carlo (MCMC) methods. Therefore, (2) is often reformulated as argmax q∈F(S×A) EXEY R(X, Y ) − KL(p ARM q (·|X), ρ(·|X)), where Y above is distributed according to p ARM q (·|X). Intuitively, instead of performing maximization in the space P(Y|X ) of all possible distributions, we perform maximization in the space of ARMs parameteriz… view at source ↗

**Figure 2.** Figure 2: Empirical validation of Proposition 2. Left: Minimizing the expected risk of an ARM and an EBM parameterized by causal and non-causal Transformers, respectively. Right: L∞ distance between the logits of the trained ARM and the logits of the optimal EBM, before and after applying the mapping M. Our results confirm that the EBM and ARM converge to the same minima, as predicted by Proposition 2. Perhaps more … view at source ↗

**Figure 3.** Figure 3: Loss convergence and logits distances for different Transformer sizes. • In [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Loss convergence and logits distances for different Transformer sizes in the case T > V . 31 [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Comparing KL divergence and logits distance in infinity norm. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

read the original abstract

Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that any autoregressive factorization defines an energy function E(x) = -sum log p(x_t | x_<t) that satisfies a special case of the soft Bellman equation, giving a direct equivalence between ARMs and EBMs.

read the letter

The core contribution is the explicit bijection in function space: start from the chain rule, set the energy to the negative sum of conditional log probabilities, and the resulting model matches the optimal policy under the soft Bellman equation from max-ent RL. This is straightforward once written down, but the authors spell out the mapping cleanly and then use it to equate supervised ARM training with EBM training. They also give error bounds on distilling an EBM into an ARM, which is the most concrete technical step. The lookahead insight follows from the fact that the energy already encodes the full joint, so next-token training implicitly optimizes a global objective. That part is mostly interpretive rather than a new derivation. The math appears to hold without extra assumptions on capacity or training dynamics, which keeps the result general but also means it applies to any autoregressive model by construction. No circularity or hidden restrictions show up in the argument. The paper is aimed at researchers who already think about RLHF, alignment, or EBMs for language models. It gives a compact way to move between the two formalisms and could be cited when discussing why next-token models sometimes exhibit planning behavior. I would send it to peer review; the unification is clean enough to be worth referee time even if the practical payoff is modest.

Referee Report

2 major / 3 minor

Summary. The paper claims to establish an explicit bijection between autoregressive language models (ARMs) and energy-based models (EBMs) in function space by starting from the chain rule of probability, showing that this bijection corresponds to a special case of the soft Bellman equation from maximum-entropy reinforcement learning. It derives the equivalence of supervised learning objectives for ARMs and EBMs, provides theoretical error bounds on distilling EBMs into ARMs, and uses the framework to explain the lookahead capabilities of next-token prediction.

Significance. If the bijection and derivations are correct, the work supplies a clean theoretical unification of the dominant LLM paradigm with EBMs that are already known to characterize optimal policies under alignment objectives. The explicit RL link offers a principled explanation for why next-token ARMs can exhibit planning behavior, and the distillation error bounds are directly usable for practical model compression. The derivation is parameter-free and rests only on the chain rule plus the standard soft Bellman equation, which are strengths.

major comments (2)

[§3] §3 (bijection derivation): the manuscript must explicitly verify that the mapping E(x) = −∑_t log p(x_t | x_<t) is bijective in function space for arbitrary joint distributions and that the resulting energy satisfies the soft Bellman equation without hidden restrictions on model capacity or the form of the energy function; this step is load-bearing for all subsequent claims.
[§4] §4 (error bounds): the stated theoretical error bounds for EBM-to-ARM distillation are central to the practical contribution; the proof should be expanded to show the precise dependence on the number of distillation steps and any assumptions on the proposal distribution.

minor comments (3)

[Notation] Notation for the energy function and the soft Bellman operator should be introduced once in §2 and used consistently thereafter to avoid reader confusion.
[Abstract] The abstract asserts 'explicit bijection' and 'theoretical error bounds' but does not preview the key equations; adding one or two displayed equations in the abstract or introduction would improve accessibility.
[Discussion] A short discussion of how the bijection behaves under finite-capacity neural-network parameterizations (as opposed to the infinite-capacity function-space case) would strengthen the bridge to practical LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The comments help clarify the presentation of the core bijection and strengthen the error bounds. We address each major comment below.

read point-by-point responses

Referee: [§3] §3 (bijection derivation): the manuscript must explicitly verify that the mapping E(x) = −∑_t log p(x_t | x_<t) is bijective in function space for arbitrary joint distributions and that the resulting energy satisfies the soft Bellman equation without hidden restrictions on model capacity or the form of the energy function; this step is load-bearing for all subsequent claims.

Authors: We agree that an explicit verification strengthens the load-bearing step. In the revised manuscript we will insert a short lemma in §3 proving bijectivity in function space: given any joint distribution p over finite-length sequences, the chain rule yields a unique E(x) = −log p(x) = −∑_t log p(x_t | x_<t); conversely, any real-valued energy E induces a unique normalized p(x) ∝ exp(−E(x)) whose autoregressive factorization recovers the original conditionals. The same E satisfies the soft Bellman equation E(x) = −log ∑_{x'} exp(−E(x')) (up to additive constants independent of x) by direct substitution of the normalization, with no restrictions on model capacity or energy functional form required—the identity holds pointwise for arbitrary positive measures. revision: yes
Referee: [§4] §4 (error bounds): the stated theoretical error bounds for EBM-to-ARM distillation are central to the practical contribution; the proof should be expanded to show the precise dependence on the number of distillation steps and any assumptions on the proposal distribution.

Authors: We thank the referee for this request. In the revision we will expand the proof of Theorem 4 to derive the explicit dependence of the total variation (or KL) error on the number of distillation steps K, obtaining a contraction of the form O(ρ^K) where ρ < 1 depends on the temperature and the minimal probability mass of the proposal. We will also state the standing assumptions on the proposal distribution q (full support over the sequence space and finite second moments) that are used to control the variance of the importance weights. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from chain rule

full rationale

The paper starts from the standard chain rule of probability to define an explicit bijection in function space between the autoregressive factorization p(x) = ∏ p(x_t | x_<t) and an energy function E(x) = -∑ log p(x_t | x_<t). This is shown to satisfy a special case of the soft Bellman equation by sequential decomposition of the log-probability, which is a direct algebraic identity holding for any joint distribution. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or defines the target result in terms of itself. The RL link follows from established maximum-entropy RL without requiring the present paper's result as an assumption. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two standard background results with no free parameters or newly invented entities.

axioms (2)

standard math Chain rule of probability
Used as the explicit starting point to construct the bijection between ARMs and EBMs.
domain assumption Soft Bellman equation in maximum-entropy RL
The bijection is asserted to be a special case of this established equation.

pith-pipeline@v0.9.0 · 5478 in / 1308 out tokens · 38188 ms · 2026-05-16T21:25:35.218166+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

q(st, yt) := r(st, yt) + Vq(st ⊕ yt) (recursive mapping M)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

A three-layer ontology framework for grounding enterprise LLM agents yields statistically significant gains in accuracy and role consistency, with larger benefits in domains where the base models have weak parametric ...
Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents
cs.AI 2026-04 unverdicted novelty 4.0

Ontology grounding improves accuracy and role consistency of enterprise LLM agents, with larger gains in domains poorly covered by training data.