A Tutorial On Variational Bayes

6d ago
0 Views
0 Downloads
936.31 KB
40 Pages
Transcription

A Tutorial on Variational BayesJunhao Hua (华俊豪)Laboratory of Machine and Biological Intelligence,Department of Information Science & Electronic Engineering,ZheJiang University2014/3/27Email: [email protected] further information, see:http://www.huajh7.com

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference2

A Problem: How Learn From Data? Typical, we use a complex statistical model,but how to learn its parameters and latentvariables? Data: X Model: P(X \theta, Z)3

Challenge Maximum Likelihood: Overfits the data Model Complexity Computational tractability Bayesian Framework: Arising intractable integrals: partition function posterior of unobserved variables Approximate Inference: Monte Carlo Sampling: e.g. MCMC, particle filter. Variational Bayes4

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference5

Variational Free energy Basic Idea:“conditional independence is enforced as a functional constraint in theapproximating distribution, and the best such approximation is found byminimization of a Kullback-Leibler divergence (KLD). ” Use a simpler variational distribution, Q(Z), to approximate thetrue posterior P(Z X) Two alternative explanations Minimize (reverse) Kullback-Leibler divergenceDKL (Q P)Q( Z ) Q(Z)log ZP( Z D) Q(Z ) logZQ( Z ) log P ( D )P( Z , D) Maximum variational free energy(lower bound)L(Q) Q( Z ) log P ( Z , D) Q( Z ) log Q( Z ) EQ [log P( Z , D)] H (Q)ZZ6

Optimization Techniques:Mean Field Approximation Originated in the statistical physics literature Conditional Independence assumption Decoupling: intractable distribution - a productof tractable marginal distributions (tractablesubgraph) Factorization:MQ( Z ) q( Zi D)i 17

Optimization Techniques:Variational methods Optimization Problem: Maximum the lower bound L(Q( Z )) EQ ( Z ) [ln P ( Z , D)] H (Q( Z )) Where Q( Z ) Qi ( Zi )i Subject to normalization constraints: i. Qi ( Z i )dZ i 1 Seek the extremum of a functional: Euler – Lagrange equation8

Derivation Consider the partition Z {Zi , Z i },where Z i Z \ Zi Consider Energy term,EQ ( Z ) [ln P ( Z , D)] ( Qi ( Z i )) ln( Z , D)dZi Qi ( Z i ) Q i ( Z i ) ln( Z , D)dZ i dZ i Qi ( Z i ) ln( Z , D)Q i ( Z i ) Qi ( Z i ) ln exp ln( Z , D)dZ iQ i ( Z i )dZ i Qi ( Z i ) ln Qi* ( Z i )dZ i lnC1exp ln( Z , D)Cnormalization constant. We define Qi* ( Z i ) Q i ( Z i ), where C is the9

Derivation (cont.) Consider the entropy,H (Q( Z )) ( Qk ( Z k )) ln Qi ( Z i )dZi k Q (Z )Qii i( Z i ) ln Qi ( Z i )dZ i dZ ii Q (Z ) ln Q (Z )dZiiiii Q (Z ) ln Q (Z )dZiiiiiQ i ( Z i )ii Then we get the functional,L(Q(Z)) Qi ( Z i ) ln Qi* ( Z i )dZ i Qi ( Z i ) ln Qi ( Z i )dZ i lnCi ( Qi ( Z i ) ln Qi* ( Z i )dZ i Qi ( Z i ) ln Qi ( Z i )dZ i ) Qk ( Z k ) ln Qk ( Z k )dZ k lnCk iQi* ( Z i ) Qi ( Z i ) lndZ i Qk ( Z k ) ln Qk ( Z k )dZ k lnCQi ( Z i )k i D KL (Qi ( Z i ) Qi* ( Z i )) H [Q i ( Z i )] ln C10

Derivation (cont.) Maximizing energy functional L w.r.t. each Q icould be achieved by Lagrange multipliers andfunctional differentiation i. { D KL [Qi ( Z i ) Qi* ( Z i )] λi ( Qi ( Z i )dZ i 1)}: 0 Qi ( Z i ) A long algebraic derivation would then eventuallylead to a Gibbs distribution; Fortunately, L will bemaximized when the KL divergence is zero,1Q Q (Zi )exp ln P( Z i , Z i , D)i (Zi )C*iQ i ( Z i ) Where C is normalization constant.11

Challenge1QQ (Zi )exp ln P( Z i , Z i , D)i (Zi )C*iQ i ( Z i ) The expectation can be intractable. We need pick a family of distributions Q thatallow for exact inference Then Find Q ' Q that maximizes thefunctional energy .12

Challenge The expectation can be intractable. We need pick a family of distributions Q thatallow for exact inference Then Find Q ' Q that maximizes thefunctional energy .Exponential Family13

Why Exponential Family ? Principle of maximum entropy Density function: Log partition function:canonical parametersMean parameters14

Properties of Exponential family Mean parameters: θ“various statistical computations, among them marginalizationand maximum likelihood estimation, can be understood astransforming from one parameterization to the other.” All realizable mean parameters Always a convex subset of d Forward mapping From canonical parameters Backward mapping From mean parametersθφ ( x) to the mean parameters θto the canonical parametersφ ( x)15

Properties of partition function A16

Conjugate Duality: Maximum Likelihoodand Maximum Entropy The variational representation of log partition function The conjugate dual function to A17

Nonconvexity for Naïve Mean Field Mean field optimization is always nonconvexfor any exponential family in which the statespace is finite. It is a strict subset of M(G) Contains all of the extreme points of polytope18

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference19

Inference in Bayesian Networks Variational Message Passing (Winn, Bishop, 2003.) Message from parents: mY X uY Message to parents: mX Y φ XY u X , {mi X }i cp Update natural parameter vector :( φY* φ Y( {mi Y}i paY) mj chYY)j YP( X φ ) exp[φ T u ( X ) f ( X ) g (φ )]20

Summary of VBMean-field AssumptionVariational MethodsQ( Zi ) 1exp ln P( Z i , Z i , D)CConjugate-exponentialfamilyForward, backwardmappingQ ( Z i ) orQ ( mb ( Zi ))21

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example VB for Mixture model Discussion Application Reference22

Mixture of Gaussian (MoG)p( X Z , µ , Λ) NK 1 znkN(x ,µΛ n k k )n 1 k 1 p ( X , Z , π , µ , Λ ) p( X Z , µ , Λ ) p( Z π ) p(π ) p( µ Λ ) p(Λ )23

Infinite Student’s t-mixture j 1j 1i 1π j (V ) V j (1 Vi )G π j (V )δ Θ j DP(α , G0 )Dirichlet ProcessStick-Breaking priorV j Beta (1, α )p (α ) Gam(α η1 ,η2 )Dirichlet Process Mixturep ( X ) N πn 1 j 1j(V ) St ( xn µ j , Λ j , v j )24

Latent Dirichlet Allocation (LDA)25

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference26

步骤一:选择无信息先验分布 原则,最大熵原则等。 gate 后验分布h(θ x)属于同一分布类型。π i 1,.,k SymDir ( K ,α 0 )Λ i 1,., k W( w0 ,υ0 )µi 1,.,k N (m0 ,( β 0 Λ i ) 1 )zi 1,., N Mult (1, π )X i 1,., N N ( µ z )说明: K:单高斯分布个数,N:样本个数 SymDir() :K维对称 )或多项式分布的共轭先验分布。 W() 先验。 Mult() 表示多项分布; 量中只有一项为1,其它都为0. N() 为多元高斯分布。X {x1 ,., xN }是N 布的K 维向量;Z { z1 ,., z N }是一组潜在变量,每项zk {z1k ,. , znk }表示对应的样本xk 属于哪个混合部分;π {π 1 ,., π K }表示每个单高斯分布混合比例;µ和Λki 1,., k i 1,., 精度;K ,α 0 , β 0 , w0 ,υ0 , m0 。27

用“盘子表示法”(plate 所示。 小正方形表示不变的超参数,如β0 ,ν0 ,α0 ,µ0 ,W0;圆圈表示随机变量,如 π , zi , xi , µ k , Λ k zi来选择其他传入的变量(µk, Ʌk)。 28

步骤二:写出联合概率密度函数 合概率密度函数可以表示为p ( X , Z , π , µ , Λ ) p( X Z , µ , Λ ) p( Z π ) p(π ) p( µ Λ ) p(Λ )N N ( xp( X Z , µ , Λ) 每个因子为: NK n 1 k 1 1 znkΛµ ,nkk )Kp ( Z π ) π kznkn 1 k 1 Γ( Kα 0 ) K α 0 1πp (π ) K kΓ(α 0 ) k 1 p ( µ Λ ) N ( µk m0 ,( β 0 Λ k ) 1 )p (Λ ) W (Λ k w0 ,ν 0 ) 其中,11 1 1Txµxµ exp()() (2π ) D / 2 1/ 2 2 1W (Λ w,ν ) B( w,ν ) Λ (ν D 1)/ 2 exp( Tr ( w 1Λ ))2Dν 1 i 1ν D / 2 D ( D 1)/ 4 ν / 2B( w,ν ) w Γ((2 π)) 2i 1N ( x µ , ) 29

步骤三:计算边缘密度(VB- marginal), π , µ , Λ ) q ( Z )q (π , µ , Λ 设,有 q( Z ln q* ( Z ) Eπ , µ , Λ [ln p ( X , Z , π , µ , Λ )] const Eπ , µ , Λ [ln p ( X Z , µ , Λ ) p ( Z π ) p (π ) p ( µ Λ ) p (Λ )] const Eπ [ln p ( Z π )] Eµ , Λ [ln p ( X Z , µ , Λ )] constN K z n 1 k 1nkln ρ nk const 11D其中 ln ρ nk E[ln π k ] E[ln Λ k ] ln(2π ) Eµk ,Λk [( xn µk )T Λ k ( xn µk )]222NKz*两边分别取对数可得, q ( Z ) ρ nk z归一化,得 q ( Z ) rnknk ,其中 rnk nkn 1 k 1 NK*n 1 k 1 ρ nk Kj 1ρ nj可见 q ( Z ) on multinomial distribution)的乘积 。更进一步,根据categorical分布,有 E[ znk ] rnk*30

K q (π , µ , Λ ) q (π ) q ( µk , Λ k )(2)计算π的概率密度,k 1ln q* (π ) EZ , µ , Λ [ p ( X Z , π , µ , Λ )] const ln p (π ) EZ [ln p ( Z π )] constKNK (α 0 -1) ln π k rnk ln π k constk 1 K 两边取对数q (π ) *q (π ) Dir (α )n 1n 1 k 1 r α 0 1 n 1 nkπ,可见Nkq* (π是Dirichlet分布,)* 其中 α Nα 0 N k ,N k rnk .n 131

最后同时考虑 µ , Λ ,对于每一个单高斯分布有, ln q* ( µk , Λ k ) EZ ,π , µi k , Λi k [ln p ( X Z , µk , Λ k ) p ( µk , Λ k )]N ln p ( µk , Λ k ) E[ znk ]ln N ( xn µk , Λ k 1 ) constn 1 rt分布,*q ( µk , Λ k ) N ( µk mk ,( β k Λ k ) 1 )W (Λ k wk ,ν k )β0 N k , β k mk 1 ( β 0 m0 N k xk ), βk w 1 w 1 N S β 0 N k ( x m )( x m )T ,k kkk000 kβ0 N k v0 N k ,k v 1 Nrnk xn , xk N k n 1 1 N Skrnk ( xk xk )( xk xk )T . N k n 1 其中32

步骤四:迭代收敛 要ρnk,而这又是基于 E[ln π k ], E[ln Λ k ], Eµ ,Λ [( xn µk ) Λ k ( xn µk 三个期望的一般表达式为:k(k)K ln π E[ln π ] ψ(α)ψ i 1α ikkk D vk 1 i lnE[ln ]ψΛ Λ i 1 2 D ln 2 ln Λ k kk 1TT ExµxµDβνxmWk ( xn mk )[()()]() Λ µk , Λ knkknkkknk 这些结果能导出, D νk 1/ 2T ( xn mk ) Wk ( xn mk ) rnk π k Λ k exp 22βk 且 k 1 rnk 1 .K33

Summary: Variational Inference for GMMq (π ) Dir (α )*α α 0 N kNN k rnkn 1Soft-count or ESS*q ( µk , Λ k ) N ( µk mk ,( β k Λ k ) 1 )W (Λ k wk ,ν k )11 Nβk β 0 N k , mk ( β 0 m0 N k xk ), vk v0 N k , xk rnk xnβkN k n 1β N1w w N k S k 0 k ( xk m0 )( xk m0 )T , S k β0 N kNk 1k 10N rn 1Tx xx x()()nkknknVBM-StepLatent variableNKq* ( Z ) rnkznkn 1 k 1 VBE-Step 1/ 2 exp D ν k ( x m )T W ( x m ) rnk π k Λknkknk22βk 34

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference35

EM vs. VB36

The Accuracy-vs-Complexitytrade-off37

Application Matrix Factorization: Probabilistic PCA, Mixtures ofPPCA, Independent Factor Analysis(IFA), nonlinearICA/IFA/SSM, Mixture of Bayesian ICA, BayesianMixture of Factor Analyzers, etc. Time Series: Bayesian HMMs, variational Kalmanfiltering, Switching State-space models, etc. Topic model: Latent Dirichlet Allocation(LDA),(Hierarchical) Dirichlet Process (Mixture) Model,Bayesian Nonparametrical Models, etc. Variational Gaussian Process Classifiers Sparse Bayesian Learning Variational Bayesian Filtering, etc.38

Reference Neal, Radford M., and Geoffrey E. Hinton. "A view of the EM algorithmthat justifies incremental, sparse, and other variants." Learning ingraphical models. Springer Netherlands, 1998. 355-368. Attias, Hagai. "Inferring parameters and structure of latent variablemodels by variational Bayes." Proceedings of the Fifteenth conference onUncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.,1999. Winn, John, Christopher M. Bishop, and Tommi Jaakkola. "VariationalMessage Passing." Journal of Machine Learning Research 6.4 (2005). Wainwright, Martin J., and Michael I. Jordan. "Graphical models,exponential families, and variational inference." Foundations andTrends in Machine Learning 1.1-2 (2008): 1-305. Šmídl, Václav, and Anthony Quinn. The variational Bayes method insignal processing. Springer, 2006. Wikipedia, Variational Bayesian methods,http://en.wikipedia.org/wiki/Variational Bayes39

Any Question ?40