Introduction

这篇笔记比较系统地阐述了一些变分推断的方法,欢迎食用。转载请注明:
https://blog.nowcoder.net/n/c0e560ac1acb42f9a88b525da39b9189
Reference: Deep Bayes

Full Bayesian inference

Training Stage
training stage
Testing Stage
Testing Stage
Comment: The denominator in training stage sometimes may be intractable. Posterior distributions can be calculated analytically only for simple conjugate models.

Approximate inference

Probabilistic model:
图片说明
Variational Inference:
Approximate 图片说明
Biased but fast and more scalable
MCMC:
samples from unnormalized 图片说明
unbiased but need a lot of samples
Some mathematic magic:
图片说明
The first item is ELBO, evidence lower bound
THe second item is KL divergence, Kullback-Leibler divergence
Variational Inference: ELBO interpretation
Final optimisation problem
图片说明
The first item is data item, the second item is regularizer.
Mean field approximation
图片说明
图片说明
then we could use the following replacement to reformulate the equation:
图片说明
So the above equation can become:
图片说明
Algorithm
Initialize
图片说明
Iterations:
\ Update each factor 图片说明
Parametric optimization
图片说明

Inference Summary

图片说明

Statistical Inference

continuous latent variables can be regarded as a mixture of a continuum of distributions 图片说明
E-step can be done in closed form only in case of contiguous distributions, otherwise the true posterior is intractable.
图片说明
Typically continuous latent variables are used for dimension reduction also known as representation learning.
Example: PCA model
Consider 图片说明,such that D>>d
Joint distribution:
图片说明
图片说明 consists of 图片说明 matrix V, D-dimensional vector 图片说明 and scalar 图片说明
EM-PCA and Mixture of PCA
joint distribution:
图片说明
Variational autoencoder
图片说明
EM for VAE
图片说明
However, the denominator is still intractable.
Variational inference
parametric variational inference
Instead of direct infering of p(z_i | x_i,\theta) let us define flexible variational approximation
图片说明
This additional Neural Network ensures tractability of the distribution while being very flexible.
图片说明
Stochastic optimization
图片说明
Problem 1: The training data is assumed to be large which means iterations might be expensive
Problem 2: The integral in ELBO is still intractable
Solution: Compute stochastic gradients by using mini-batching and Monto-Carlo estimation
Optimization w.r.t.
图片说明
Mini-batching
图片说明
However, if we use Monte-Carlo estimation:
图片说明
However, when it comes to , it is another case:
Can no longer move gradient inside integral
图片说明
Log-derivative trick
图片说明
if we apply the trick, it yields to:
图片说明
Then the expectations can be estimated using monte carlo methods.
Log-derivative trick for ELBO
图片说明
Now consider its first term and apply mini-batching and log-derivative trick
图片说明
We can prove that the score function: is zero mean.
REINFORCE
图片说明
However, the term can be arbitrary large negative that leads to very unstable stochastic gradients
A partial solution is to use baselines
Consider a function , such that:
图片说明
Remember that the so-called score function can meet the requirements.
图片说明
我是分割线


I am a lazy man.

图片说明
图片说明

Reparameterization trick

Consider differentiation of complex expectation
图片说明
Express as a deterministic function g(.) of random and and perform change-of-variables rule
图片说明
Then stochastic differentiation is simply
图片说明
我又懒了~~


图片说明
图片说明

Conclusion

Good Good Study,Day Day Up