Gumbel Softmax

REINFORCE
Gumbel-Max
- Gumbel Distribution
- Deriving Gumbel-Max
Putting It Together

Gumbel Softmax aims to tackle discrete optimization. This blog contains my understanding of it, as well as necessary deductions.

REINFORCE

Suppose that there is an underlying distribution of variable . We would like to optimize , on which the distribution is parameterized on. Yet the act of sampling from precludes the back-propagation of gradients to .

Originally, discrete optimization is tackled in RL style through REINFORCE algorithm, where gradient of expectation is converted to expectation of gradient.

More problems arise, such as high variance (convergence issues) and sample efficiency at each step.

Gumbel Softmax relieves them to a certain extent.

Gumbel-Max

Sampling from an arbitrary discrete distribution can be reparameterized through the Gumbel-Max trick. Suppose that we would like to draw from its distribution parameterized by , i.e. , where . Gumbel-Max takes the form of:

It can be prooved that:

Merit: The parameters of the underlying distribution now appear explicitly in the sampling process. Moreover, the sampling process is transferred to .

Gumbel Distribution

Let us inspect the distribution of

Its density function can be derived as

Deriving Gumbel-Max

Denote as .

We have

Therefore, sampling from arbitrary categorical distribution can be reparameterized to sampling from Gumbel distribution.

Putting It Together

Recall that sampling from arbitrary categorical distribution can be reparameterized as sampling from Gumbel distribution.

To tackle the problem underlying discrete optimization, we would like to approximate discrete operations with continuous operations, such that gradients flow back to the parameters that the sampled distribution is conditioned on (Here, ).

It is intuitive that by replacing argmax with softmax, we would obtain a continuous approximation where gradients to distribution parameter are well defined. Moreover, by annealing the temperature to 0, the softmax operation would approach argmax in both sampled value and expectation value.

In the computation graph, we can replace the intended sampled values with:

This approach basically alleviates the sampling process by resorting to expectation values.