EM Proof

The proof relies on KL divergence

Given parameters θ\theta (arbitrary parameter) and θ\theta' (previous guess), and observation, we can define two posterior state sequence (states of the sequence of observation) distributions:

P1(s1:T)=f(s1:T,y1:Tθ)f(y1:Tθ)P2(s1:T)=f(s1:T,y1:Tθ)f(y1:Tθ)\begin{align*} P_1(s_{1:T}) &= \frac{f(s_{1:T}, y_{1:T}|\theta')}{f(y_{1:T}|\theta')} \\ P_2(s_{1:T}) &= \frac{f(s_{1:T}, y_{1:T}|\theta)}{f(y_{1:T}|\theta)} \end{align*}

Now recall the expected complete-data log-likelihood and apply the definition of posterior sequence probability above

Q(θ,θ)Es1:Ty1:T,θ[logf(y1:T,s1:Tθ)]=s1:TSTP1(s1:T)logf(y1:T,s1:Tθ)=1f(y1:Tθ)s1:TSTf(s1:T,y1:Tθ)logf(y1:T,s1:Tθ)\begin{align*} Q(\theta, \theta') & \equiv E_{s_{1:T}|y_{1:T}, \theta'}\left[ \log f(y_{1:T}, s_{1:T}|\theta)\right] \\ & = \sum_{s_{1:T}\in S^T} P_1(s_{1:T}) * \log f(y_{1:T}, s_{1:T}|\theta)\\ & = \frac{1}{f(y_{1:T}|\theta')} * \sum_{s_{1:T}\in S^T} f(s_{1:T}, y_{1:T}|\theta') * \log f(y_{1:T}, s_{1:T}|\theta)\\ \end{align*}

So the KL divergence of two posterior probabilities are

DKL(P1P2)=s1:TSTf(s1:T,y1:Tθ)f(y1:Tθ)log(f(s1:T,y1:Tθ)f(y1:Tθ)f(s1:T,y1:Tθ)f(y1:Tθ))=logf(y1:Tθ)f(y1:Tθ)+s1:TSTf(s1:T,y1:Tθ)f(y1:Tθ)log(f(s1:T,y1:Tθ)f(s1:T,y1:Tθ))=logf(y1:Tθ)f(y1:Tθ)+Q(θ,θ)Q(θ,θ)\begin{align*} D_{KL}(P_1||P_2) & = \sum_{s_{1:T}\in S^T} \frac{f(s_{1:T},y_{1:T}|\theta')}{f(y_{1:T}|\theta')} \log \left( \frac{f(s_{1:T},y_{1:T}|\theta')f(y_{1:T}|\theta)}{f(s_{1:T},y_{1:T}|\theta)f(y_{1:T}|\theta')} \right) \\ & = \log \frac{f(y_{1:T}|\theta)}{f(y_{1:T}|\theta')} + \sum_{s_{1:T}\in S^T} \frac{f(s_{1:T},y_{1:T}|\theta')}{f(y_{1:T}|\theta')} \log \left( \frac{f(s_{1:T},y_{1:T}|\theta')}{f(s_{1:T},y_{1:T}|\theta)} \right) \\ & = \log \frac{f(y_{1:T}|\theta)}{f(y_{1:T}|\theta')} + Q(\theta', \theta') - Q(\theta, \theta') \\ \end{align*}

Since KL divergence is greater than 0, we have

logf(y1:Tθ)logf(y1:Tθ)Q(θ,θ)Q(θ,θ)\begin{align*} \log f(y_{1:T}|\theta) - \log f(y_{1:T}|\theta') \geq Q(\theta, \theta') - Q(\theta', \theta') \end{align*}

This implies that if the arbitrary parameter θ\theta is chosen such that Q(θ,θ)Q(θ,θ)Q(\theta, \theta')\geq Q(\theta', \theta'), we will increase the log likelihood of the data. At maximization step, θ\theta is chosen by maximizing Q(θ,θ)Q(\theta, \theta'), so the greater or equal criteria is satisfied by definition and thus guarantee log likelihood to be increasing (hence local maximum).

Note that technically, the maximization doesn't need to pick θ\theta to maximize Q(θ,θ)Q(\theta, \theta'). As long as the new θ\theta satisfies the greater inequality, we can gurantee log likelihood to be always increasing.This version is called generalized EM algorithm

Last updated