Variational Inference and Monte Carlo Sampling are
currently the two chief ways of doing approximate Bayesian inference. In the
Bayesian setting, we typically have some observed variables

Variational Inference's approximation is made by choosing a family of
distributions

Looking at this formulation, the first thing you should be thinking is, "We
don't even know how to calculate

This is the "mean field approximation" and will allow us to optimize each

When all 3 conditions are met -- the mean field approximation, the univariate posteriors lie in the exponential family, and that the individual variational distributions match -- we can apply Coordinate Ascent to minimize the KL-divergence between the mean field distribution and the posterior.

# Derivation of the Objective

The original intuition for Variational Inference stems from lower bounding
the marginal likelihood of the observed variables

First, let's derive a lower bound on the likelihood of the observed variables,

Since

From this expression, we can see that minimizing the KL divergence over

At this point, we still have an intractable problem. Even evaluating the KL
divergence requires taking an expectation over all settings for

# The Mean Field Approximation

The key to avoiding the massive sum of the previous equation is to assume that

Suppose we make this assumption and that we want to perform coordinate ascent
on a single index

At this point, we'll make the assumption that

Here

Plugging this back into the previous equation (we define it to be

On this last line, we use the property

So what is this expression? It says that in order to update

# Example

For this part, let's take a look at the model defined by Latent Dirichlet Allocation (LDA),

**Input:** document-topic prior

- For each topic
$k = 1 \ldots K$ - Sample topic-word parameters
$\phi_{k} \sim \text{Dirichlet}(\beta)$

- Sample topic-word parameters
- For each document
$i = 1 \ldots M$ - Sample document-topic parameters
$\theta_i \sim \text{Dirichlet}(\alpha)$ - For each token
$j = 1 \ldots N$ - Sample topic
$z_{i,j} \sim \text{Categorical}(\theta_i)$ - Sample word
$x_{i,j} \sim \text{Categorical}(\phi_{z_{i,j}})$

- Sample topic

- Sample document-topic parameters

First, a short word on notation. In the following I'll occasionally drop
indices to denote all variables with the same prefix. For example, when I say

Our goal now is to derive the posterior distribution over the latent
variables, given the hyperparameters and the observed variables,

**Outline** Deriving the update rules for Variational Inference requires we
do 3 things. First, we must derive the posterior distribution for each hidden
variable given all other variables, hidden and observed. This distribution must
lie in the exponential family, and the corresponding variational distribution for
that variable must be of the same form. For example, if

Second, we need to derive, for each hidden variable, the function that gives us the parameters for the posterior distribution over that variable given all others, hidden and observed.

Finally, we'll need to plug the functions we just derived into an expectation with respect to the mean field distribution. If we are able to calculate this expectation for a particular hidden variable, we can use it to update the matching variational distribution's parameters.

In the following, I'll show you how to derive the update for the variational
distribution of one of the hidden variables in LDA,

**Step 1** First, we must show that the posterior distribution over each
individual hidden variable lies in the exponential family. This is not always
the case, but for models that employ conjugate priors, this
can be guaranteed. A conjugate prior dictates that if

**Step 2** Next, we derive the parameter function for each hidden variable
as a function of all other variables, hidden and observed. Let's see how this
plays out for the Dirichlet distribution,

The exponential family form of the Dirichlet distribution is,

The exponential family form of a Categorical distribution is,

Thus, the posterior distribution for

Notice how

**Step 3** Now we need to take the expectation over the parameter
function we just derived with respect to the mean field distribution. For

**Conclusion** We've now derived the update rule for one of the components of
the mean field distribution,

# Aside: Coordinate Ascent is Gradient Ascent

Coordinate Ascent on the Mean Field Approximation is the "traditional" way one does Variational Inference, but Coordinate Ascent is far from the only optimization method we know. What if we wanted to do Gradient Ascent? What would an update look like then?

It ends up that for the Variational Inference objective, Coordinate Ascent
*is* Gradient Ascent with step size equal to 1. Actually, that's only half true
-- it's Gradient Ascent using a "Natural Gradient" (rather than the usual
gradient defined with respect to

**Gradient Ascent** First, recall the Gradient Ascent update for

**Natural Gradient** Hmm, that

So what do I mean "a direction of steepest ascent"? Let's look at the
gradient of a function as the solution to the following problem as

A natural gradient with respect to

Swapping the squared Euclidean metric

While at first the gradient and natural gradient may seem difficult to
relate, suppose that

As

First, let's take the first-order Taylor approximation to

Plugging this back into the definition of

Looking at the expression for

Finally, let's define a Gradient Ascent algorithm in terms of the Natural Gradient, rather than the regular gradient,

Look at that --

# Extensions

The Variational Inference method I described here, while general in concept,
can only easily be applied to a very particular class models -- ones where

In addition, we restricted *not match the
marginal distribution over $z_k$ at all.* This is a common source of confusion
for first-time users, and makes debugging Variational Inference algorithms
rather difficult.

Third, the Coordinate Ascent algorithm described is not necessarily quick. I explained how Coordinate Ascent is really just Gradient Ascent on the natural gradient, so it's easy to ask what other methods we might be able to apply.

Here are a handful of papers that extend Variational Inference to faster optimization methods, different variational distribution, and non-conjugate models.

"Fast Variational Inference in the Conjugate Exponential Family" -- Conjugate Gradient applied to the Marginalized Variational Bound. Shows that the Marginalized Variational Bound upper bounds the typical Variational Bound and that the former also has better curvature. That means second-order optimizers like Conjugate Gradient can take larger steps and render better performance.

"Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression" -- fits a (potentially) non-decomposable exponential family distribution via Linear Regression. Involves looking at KL divergence between unnormalized variational distribution and joint distribution of model, taking derivative with respect to variational distribution's parameters and setting to 0, then solving for the parameters. Can be applied to non-conjugate models due to sampling for estimating expectations.

"Variational Inference in Nonconjugate Models" -- Getting away from conjugate priors via Laplace and the Delta Method.

# References

The seminal work on the Natural Gradient is due to Shunichi Amari's "Natural Gradient Works Efficiently in Learning". The derivation for the natural gradient is Theorem 1. Thanks to Alexandre Passos for suggesting this and giving a short-hand intuition of the proof.

The derivation for Variational Inference and the correspondence between Coordinate Ascent and Gradient Ascent is based on the introduction to Matt Hoffman et al.'s "Stochastic Variational Inference".

## Comments