The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison , Andriy Mnih , Yee Whye Teh

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords discreterandomconcretestochasticvariablesdistributiongradientsgraph

read the original abstract

The reparameterization trick enables optimizing large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack useful reparameterizations due to the discontinuous nature of discrete states. In this work we introduce Concrete random variables---continuous relaxations of discrete random variables. The Concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, Concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-probability of latent stochastic nodes) on the corresponding discrete graph. We demonstrate the effectiveness of Concrete relaxations on density estimation and structured prediction tasks using neural networks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
cs.LG 2026-05 unverdicted novelty 7.0

HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
cs.LG 2026-05 unverdicted novelty 7.0

LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapt...
Approximation-Free Differentiable Oblique Decision Trees
cs.LG 2026-05 unverdicted novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
cs.LG 2026-05 unverdicted novelty 7.0

AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
cs.LG 2026-05 unverdicted novelty 7.0

ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 7.0

Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.
LumiMotion: Improving Gaussian Relighting with Scene Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

LumiMotion improves albedo estimation and scene relighting in dynamic scenes by leveraging motion to separate lighting effects from surface appearance in a dynamic 2D Gaussian Splatting representation.
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
cs.CV 2026-04 unverdicted novelty 7.0

LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks
cs.LG 2026-03 unverdicted novelty 7.0

In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.
Rethinking the Harmonic Loss via Non-Euclidean Distance Layers
cs.LG 2026-03 unverdicted novelty 7.0

Non-Euclidean distance variants of harmonic loss improve accuracy, gradient stability, and energy efficiency over cross-entropy and Euclidean harmonic loss in vision backbones and large language models.
CapsID: Soft-Routed Variable-Length Semantic IDs for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

CapsID uses probabilistic capsule routing and confidence-based termination to generate variable-length semantic IDs, improving recall by 9.6% over strong baselines with half the latency of dual-representation systems.
Robust Multimodal Recommendation via Graph Retrieval-Enhanced Modality Completion
cs.IR 2026-05 unverdicted novelty 6.0

GRE-MC retrieves relevant subgraphs and uses a graph transformer plus sparse codebook to complete missing modalities, outperforming prior methods on recommendation benchmarks.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 6.0

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
SWAN: World-Aware Adaptive Multimodal Networks for Runtime Variations
cs.LG 2026-04 unverdicted novelty 6.0

SWAN is the first adaptive multimodal network that meets variable compute budgets, optimizes layer use by sample complexity, and drops irrelevant features, cutting FLOPs up to 49% in 3D object detection with minimal a...
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...