arxiv: 1611.01144 · v5 · submitted 2016-11-03 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Categorical Reparameterization with Gumbel-Softmax

Eric Jang , Shixiang Gu , Ben Poole

Authors on Pith no claims yet

Pith reviewed 2026-05-11 19:57 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Gumbel-Softmaxcategorical latent variablesreparameterizationgradient estimationdiscrete neural networksstructured predictiongenerative modeling

0 comments

The pith

The Gumbel-Softmax distribution supplies differentiable samples from categorical distributions, allowing gradient-based training of neural networks with discrete latent variables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Categorical variables are a natural way to represent discrete structure, yet stochastic neural networks have largely avoided them because samples from a categorical distribution are non-differentiable and therefore block backpropagation. The paper introduces the Gumbel-Softmax as a continuous relaxation whose samples can be annealed toward exact categorical samples while still providing usable gradients. This estimator is shown to outperform prior methods on structured output prediction and unsupervised generative modeling tasks that rely on categorical latents, and to accelerate semi-supervised classification. If the approach holds, it removes a long-standing obstacle to end-to-end optimization of models that must reason over discrete choices.

Core claim

We present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.

What carries the argument

The Gumbel-Softmax distribution, formed by adding Gumbel noise to class logits and applying a softmax with temperature, which reparameterizes categorical sampling to permit differentiation.

If this is right

Neural networks containing categorical latent variables can be trained end-to-end with standard backpropagation rather than reinforcement-learning estimators.
Unsupervised generative models that use discrete latents become directly optimizable, potentially improving density estimation on discrete data.
Semi-supervised classifiers that rely on categorical latent structure can be trained with substantially fewer gradient steps.
Structured output prediction problems gain access to gradient signals through the discrete variables that define the output structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relaxation could be applied to other discrete distributions beyond the categorical case, such as Bernoulli or multinomial variables in different contexts.
In variational inference settings, the estimator might allow tighter bounds or richer posterior approximations when latents must be discrete.
Downstream applications such as discrete-action reinforcement learning or combinatorial optimization could adopt the same reparameterization without custom gradient derivations.

Load-bearing premise

The Gumbel-Softmax distribution can be smoothly annealed into a categorical distribution while maintaining useful gradient estimates for optimization.

What would settle it

A controlled experiment in which models trained with the annealed Gumbel-Softmax estimator achieve no improvement over strong baselines on the structured prediction and generative modeling tasks described in the paper.

read the original abstract

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Gumbel-Softmax gives a practical differentiable relaxation for categorical sampling that beats REINFORCE and straight-through in the reported tasks.

read the letter

Hi, the central point is that this paper shows how to replace non-differentiable categorical samples with a continuous Gumbel-Softmax relaxation that can be annealed to the discrete limit. You add Gumbel noise to the log probabilities, scale by temperature, and take softmax; the math follows from the known Gumbel-max property, so the limit as temperature hits zero recovers the categorical sample in distribution. They use a decreasing temperature schedule during training to make gradients useful throughout. What they do well is run the estimator on three concrete settings: structured output prediction, VAEs with categorical latents, and semi-supervised classification. In each case it outperforms REINFORCE, NVIL, and straight-through baselines while delivering speedups, and they specify the schedule and release code. That makes the empirical claim reproducible and directly testable. The soft spot is the temperature schedule itself. It is an extra hyperparameter that must be chosen and annealed, and while the experiments demonstrate workable choices across tasks, a new user still has to tune it rather than getting a parameter-free method. This is not a load-bearing flaw because the results already test the annealing assumption, but it is worth noting that the method is not entirely automatic. The paper is aimed at anyone training stochastic networks with discrete latent variables, especially in generative modeling or structured prediction. Readers who have run into high-variance gradient estimates for categorical latents will find the construction and the head-to-head numbers useful. It deserves peer review because the idea is simple, the experiments are relevant and consistent, and the code release lowers the barrier to checking the claims.

Referee Report

1 major / 3 minor

Summary. The paper introduces the Gumbel-Softmax distribution, defined as y = softmax((log π + g)/τ) with g ~ Gumbel(0,1), as a continuous relaxation of categorical sampling that permits backpropagation. By annealing the temperature τ to zero during training, samples converge in distribution to categorical draws. Experiments in Sections 4.1–4.3 demonstrate that this estimator outperforms REINFORCE, NVIL, and straight-through estimators on structured output prediction, categorical VAEs, and semi-supervised classification, while also enabling computational speedups; code is released.

Significance. If the reported improvements hold under the described annealing schedule, the method removes a major barrier to training neural networks with categorical latent variables, enabling more expressive generative models and structured prediction systems. The direct empirical test of gradient utility during annealing, together with released code, provides a reproducible foundation that strengthens the contribution beyond the central reparameterization trick.

major comments (1)

[Sections 4.1–4.3] Sections 4.1–4.3: the outperformance claims rest on single-run comparisons without reported standard deviations across random seeds or statistical significance tests; this weakens the strength of the conclusion that the estimator is reliably superior to the listed baselines.

minor comments (3)

[Section 3] The temperature annealing schedule is described but lacks an explicit functional form or pseudocode; adding this would improve reproducibility.
[Section 4] Figure captions and axis labels in the experimental plots could more clearly indicate the number of samples used per gradient estimate for each method.
[Abstract] The abstract states 'large speedups' without numerical values; these should be stated explicitly with reference to the relevant table or figure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the paper and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: Sections 4.1–4.3: the outperformance claims rest on single-run comparisons without reported standard deviations across random seeds or statistical significance tests; this weakens the strength of the conclusion that the estimator is reliably superior to the listed baselines.

Authors: We agree that single-run results limit the strength of the empirical claims. The original experiments were performed with single runs primarily due to the computational cost of training the models at the time. In the revised manuscript, we will rerun the experiments from Sections 4.1–4.3 using multiple random seeds (at least 5 per method) and report mean performance together with standard deviations. This will allow readers to assess the reliability of the observed improvements over REINFORCE, NVIL, and straight-through estimators. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core construction defines the Gumbel-Softmax distribution directly from the Gumbel-Max trick (a known external result) as y = softmax((log π + g)/τ) with g ~ Gumbel(0,1), then proves the limit property as τ → 0 mathematically without reducing to fitted inputs or self-citations. The annealing schedule and gradient utility are not derived as predictions but are instead tested empirically in sections 4.1–4.3 against baselines like REINFORCE. No load-bearing self-citation, self-definitional loop, or renaming of known results occurs; the method is a standard reparameterization extension whose performance claims rest on independent experiments rather than construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the mathematical properties of the Gumbel distribution and the reparameterization trick, with the temperature as a key tunable parameter.

free parameters (1)

temperature parameter
Used to control the smoothness of the approximation, likely tuned during training.

axioms (1)

standard math The Gumbel distribution can be used to sample from categorical via argmax
This is a known property used to derive the estimator.

invented entities (1)

Gumbel-Softmax distribution no independent evidence
purpose: Differentiable approximation to categorical sampling
Newly defined in the paper as a relaxation.

pith-pipeline@v0.9.0 · 5403 in / 1124 out tokens · 58621 ms · 2026-05-11T19:57:48.689348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
cs.LG 2026-05 unverdicted novelty 7.0

HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
Test-time Sparsity for Extreme Fast Action Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
Marginal multi-object multi-frame blind deconvolution
astro-ph.IM 2026-05 unverdicted novelty 7.0

A marginal estimator for blind deconvolution in solar imaging that integrates out object uncertainty to improve regularization and allow plug-and-play hyperparameter optimization.
Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning
cs.LG 2026-05 unverdicted novelty 7.0

SoftBlobGIN combines ESM-2 representations with protein contact graphs via a lightweight GNN and differentiable substructure pooling to achieve 92.8% accuracy on enzyme classification, raise binding-site AUROC to 0.98...
Approximation-Free Differentiable Oblique Decision Trees
cs.LG 2026-05 unverdicted novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN
cs.RO 2026-05 unverdicted novelty 7.0

PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rate...
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
cs.LG 2026-05 unverdicted novelty 7.0

AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
cs.LG 2026-05 unverdicted novelty 7.0

ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 7.0

Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.
Optimal sensor placement for the reconstruction of ocean states using differentiable Gumbel-Softmax sampling operator
physics.ao-ph 2026-04 conditional novelty 7.0

A Gumbel-Softmax-based differentiable optimization framework for sensor placement halves reconstruction RMSE for sea surface height with only 0.1% observations versus random sampling.
Differentiable Satellite Constellation Configuration via Relaxed Coverage and Revisit Objectives
cs.RO 2026-04 unverdicted novelty 7.0

Four continuous relaxations turn non-differentiable coverage and revisit calculations into a fully differentiable pipeline that optimizes satellite orbits via gradients and outperforms metaheuristics.
SecureRouter: Encrypted Routing for Efficient Secure Inference
cs.CR 2026-04 unverdicted novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
q-bio.QM 2026-04 unverdicted novelty 7.0

CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
cs.CV 2026-05 unverdicted novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI
eess.IV 2026-05 unverdicted novelty 6.0

NexOP jointly optimizes NEX-aware k-space sampling probabilities and multi-measurement reconstruction to raise effective SNR in low-field MRI under a fixed total sampling budget.
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
CapsID: Soft-Routed Variable-Length Semantic IDs for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

CapsID uses probabilistic capsule routing and confidence-based termination to generate variable-length semantic IDs, improving recall by 9.6% over strong baselines with half the latency of dual-representation systems.
Adaptive Dual-Path Framework for Covert Semantic Communication
cs.AI 2026-05 unverdicted novelty 6.0

An adaptive dual-path framework for covert semantic communication achieves near-random attacker detection of 56.12% on Cityscapes while outperforming baselines on primary semantic tasks.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
Triple Spectral Fusion for Sensor-based Human Activity Recognition
cs.AI 2026-05 unverdicted novelty 6.0

A triple spectral fusion method using adaptive filtering in three domains improves human activity recognition from inertial sensors on benchmark datasets.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 6.0

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions
cs.AI 2026-04 unverdicted novelty 6.0

A robust semi-Markov RL agent with MILP feasibility projection and Wasserstein ambiguity set achieves $1.22M net profit on an NYC EV simulator with zero feeder violations, outperforming heuristic and other RL baselines.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

A scene-agnostic object codebook learned via unsupervised object-centric learning provides consistent identity-anchored representations for 3D Gaussians across multiple scenes.
From Universal to Individualized Actionability: Revisiting Personalization in Algorithmic Recourse
cs.LG 2026-04 unverdicted novelty 6.0

Formalizing personalization as individual actionability in causal recourse shows hard constraints degrade validity and plausibility while revealing socio-demographic disparities in costs.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 5.0

LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
SGP-SAM: Self-Gated Prompting for Transferring 3D Segment Anything Models to Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

SGP-SAM transfers 3D SAM to lesion segmentation using a self-gated module for conditional multi-scale enhancement and a Zoom Loss, achieving 7.3% mDice gain over fine-tuning on MSD Liver Tumor data.
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
cs.CV 2026-04 conditional novelty 5.0

SEPatch3D accelerates ViT-based 3D object detectors up to 57% faster than StreamPETR via dynamic patch sizing and cross-granularity enhancement while keeping comparable accuracy on nuScenes and Argoverse 2.
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
cs.LG 2026-04 unverdicted novelty 5.0

Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
MO-RiskVAE: A Multi-Omics Variational Autoencoder for Survival Risk Modeling in Multiple MyelomaMO-RiskVAE
cs.LG 2026-04 unverdicted novelty 5.0

Moderate relaxation of KL regularization and hybrid continuous-discrete latent spaces improve survival discrimination in multi-omics VAEs for multiple myeloma.
Structure-Augmented Standard Plane Detection with Temporal Aggregation in Blind-Sweep Fetal Ultrasound
cs.CV 2026-04 unverdicted novelty 4.0

Structure augmentation via segmentation prior plus temporal aggregation stabilizes keyframe detection of fetal abdomen planes in blind-sweep ultrasound.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
cs.CV 2026-04 unverdicted novelty 4.0

HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
Multiple Domain Generalization Using Category Information Independent of Domain Differences
cs.CV 2026-04 unverdicted novelty 4.0

A domain generalization method extracts domain-independent category features and applies SQ-VAE to bridge gaps, yielding accuracy gains on vascular and cell nucleus segmentation datasets.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 35 Pith papers · 3 internal anchors

[1]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. L´eonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Info- gan: Interpretable representation learning by information maximizing generative adversarial nets

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info- gan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657,

work page arXiv
[3]

arXiv preprint arXiv:1609.01704 , year=

J. Chung, S. Ahn, and Y . Bengio. Hierarchical multiscale recurrent neural networks.arXiv preprint arXiv:1609.01704,

work page arXiv
[4]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,

work page internal anchor Pith review arXiv
[5]

Deep AutoRegressive Networks

K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. arXiv preprint arXiv:1310.8499,

work page arXiv
[6]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Mnih and D

A. Mnih and D. J. Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725,

work page arXiv
[8]

Raiko, M

T. Raiko, M. Berglund, G. Alain, and L. Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989,

work page arXiv
[9]

9 Published as a conference paper at ICLR 2017 D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate infer- ence in deep generative models. arXiv preprint arXiv:1401.4082, 2014a. D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate infer- ence in deep generative models. InProceedings of Th...

work page Pith review arXiv 2017
[10]

Rethinking the Inception Architecture for Computer Vision

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567,

work page Pith review arXiv
[11]

K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y . Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044,

work page arXiv
[12]

Figure 6: Semi-supervised generative model proposed by Kingma et al

A S EMI -S UPERVISED CLASSIFICATION MODEL Figures 6 and 7 describe the architecture used in our experiments for semi-supervised classiﬁcation (Section 4.3). Figure 6: Semi-supervised generative model proposed by Kingma et al. (2014). (a) Generative modelpθ(x|y,z ) synthesizes images from latent Gaussian “style” variable z and categorical class variable y....

work page 2014
[13]

k−1∏ i=1 f(xk +gk,xi−ui) = ∫ ∞ −∞ dgke−gk−e−gk k−1∏ i=1 exi−ui−xk−gk−exi −ui −xk −gk 11 Published as a conference paper at ICLR 2017 We perform a change of variables withv =e−gk, sodv =−e−gkdgk anddgk =−dvegk =dv/v, and deﬁneuk = 0 to simplify notation: p(u1,...,u k,−1) =δ(uk =

work page 2017
[14]

∫ ∞ 0 dv 1 vvexk−v k−1∏ i=1 vexi−ui−xk−vexi −ui −xk (15) = exp ( xk + k−1∑ i=1 (xi−ui) ) ( exk + k−1∑ i=1 ( exi−ui ) )−k Γ(k) (16) = Γ(k) exp ( k∑ i=1 (xi−ui) ) ( k∑ i=1 ( exi−ui ) )−k (17) = Γ(k) ( k∏ i=1 exp (xi−ui) ) ( k∑ i=1 exp (xi−ui) )−k (18) B.2 T RANSFORMING TO A GUMBEL -SOFTMAX Given samples u1,...,u k,−1 from the centered Gumbel distribution, w...

work page 2017