How to Train Your Energy-Based Models

Diederik P. Kingma; Yang Song

arxiv: 2101.03288 · v2 · pith:4IHUFDEBnew · submitted 2021-01-09 · 💻 cs.LG · stat.ML

How to Train Your Energy-Based Models

Yang Song , Diederik P. Kingma This is my paper

classification 💻 cs.LG stat.ML

keywords modelsebmstrainingapproachesconstantnormalizingenergy-basedprobabilistic

0 comments

read the original abstract

Energy-Based Models (EBMs), also known as non-normalized probabilistic models, specify probability density or mass functions up to an unknown normalizing constant. Unlike most other probabilistic models, EBMs do not place a restriction on the tractability of the normalizing constant, thus are more flexible to parameterize and can model a more expressive family of probability distributions. However, the unknown normalizing constant of EBMs makes training particularly difficult. Our goal is to provide a friendly introduction to modern approaches for EBM training. We start by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching (SM) and Noise Constrastive Estimation (NCE). We highlight theoretical connections among these three approaches, and end with a brief survey on alternative training methods, which are still under active research. Our tutorial is targeted at an audience with basic understanding of generative models who want to apply EBMs or start a research project in this direction.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space
cs.LG 2026-07 unverdicted novelty 7.0

Model merging is cast as PoE inference with EBM experts, revealing Gaussian assumptions in prior work and proposing convergent Cauchy experts that improve empirical performance.
Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation
cs.AI 2026-06 unverdicted novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis
cs.CV 2026-04 unverdicted novelty 7.0

Patient-specific energy manifolds from baseline mpMRI scans act as fixed geometric references to monitor longitudinal evolution of voxel distributions in sequence space for neuro-oncology proof-of-concept cases.
Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis
cs.CV 2026-04 unverdicted novelty 7.0

A patient-specific energy manifold learned from baseline multiparametric MRI provides a geometric reference system for tracking longitudinal tissue changes through energy and displacement analysis in sequence space.
Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy
cs.LG 2026-03 unverdicted novelty 7.0

Langevin sampling on the modern Hopfield energy produces training-free stochastic attention that transitions from exact retrieval to generation as temperature rises, with an entropy inflection condition marking the shift.
Contrastive Residual Energy Test-time Adaptation
cs.LG 2025-05 unverdicted novelty 7.0

CreTTA reformulates test-time adaptation of marginal distributions as residual energy learning, producing a contrastive objective that cancels the partition function and uses relative energy differences for adaptive g...
Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond
stat.ML 2025-02 unverdicted novelty 7.0

Derives Õ(d β² A² / ε⁴) oracle complexity for AIS estimating normalizing constant Z to relative error ε and introduces reverse diffusion sampler for geometric paths with large action.
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
cs.LG 2023-03 unverdicted novelty 7.0

Stochastic interpolants unify flow-based and diffusion-based generative models by bridging target densities exactly via latent-variable processes whose drifts minimize quadratic objectives.
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
cs.CV 2021-08 conditional novelty 7.0

SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.
Revisiting the Volume Hypothesis
cs.LG 2026-06 unverdicted novelty 6.0

The generalization advantage of SGD over random sampling diminishes with growing training set size in binary networks, as measured by joint density of states over train and test accuracy.
Error bounds for simultaneous Wasserstein contractive adaptive increasingly rare MCMC
math.ST 2026-06 unverdicted novelty 6.0

Explicit MSE bounds derived for time-average estimators in adaptive increasingly rare MCMC under simultaneous Wasserstein contraction.
Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning
cs.MA 2026-05 unverdicted novelty 6.0

Decentralized diffusion policies trained with importance sampling score matching enhance exploration and performance in cooperative MARL over Gaussian policy baselines.
Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective
cs.LG 2026-05 unverdicted novelty 6.0

Training and sampling in static scalar energy generative models are two instances of the same Lyapunov-driven density transport dynamics on Wasserstein space, differing only by initial condition, which yields a finite...
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
cs.LG 2025-12 unverdicted novelty 6.0

Autoregressive language models are equivalent to energy-based models through a bijection that corresponds to the soft Bellman equation, explaining their lookahead capabilities despite next-token training.
From Action Labels to Sets: Rethinking Action Supervision for Imitation Learning from Corrective Feedback
cs.RO 2025-02 unverdicted novelty 6.0

CLIC uses set-valued action targets from interactive human corrections instead of pointwise labels to train more robust imitation learning policies.
ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs
cs.LG 2026-06 unverdicted novelty 5.0

ERAlign aligns GNN and LLM embeddings on text-attributed graphs via energy-based models and an Energy Discrepancy objective, reporting state-of-the-art results on eight datasets under varying supervision.
Discovering interpretable low-dimensional dynamics using maximum entropy
q-bio.QM 2026-05 unverdicted novelty 5.0

Edwin integrates dynamic maximum entropy dimensionality reduction with symbolic regression to recover physically interpretable low-dimensional dynamics from high-dimensional observations that generalize to unseen conditions.
The Score-Difference Flow for Implicit Generative Modeling
cs.LG 2023-04 unverdicted novelty 5.0

Score-difference flow reduces KL divergence between distributions and is formally equivalent to denoising diffusion models and a hidden subproblem in optimal GAN training under stated conditions.
Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications
eess.SP 2025-11 unverdicted novelty 3.0

The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.