The proof relies on KL divergence
Given parameters θ (arbitrary parameter) and θ′ (previous guess), and observation, we can define two posterior state sequence (states of the sequence of observation) distributions:
P1(s1:T)P2(s1:T)=f(y1:T∣θ′)f(s1:T,y1:T∣θ′)=f(y1:T∣θ)f(s1:T,y1:T∣θ) Now recall the expected complete-data log-likelihood and apply the definition of posterior sequence probability above
Q(θ,θ′)≡Es1:T∣y1:T,θ′[logf(y1:T,s1:T∣θ)]=s1:T∈ST∑P1(s1:T)∗logf(y1:T,s1:T∣θ)=f(y1:T∣θ′)1∗s1:T∈ST∑f(s1:T,y1:T∣θ′)∗logf(y1:T,s1:T∣θ) So the KL divergence of two posterior probabilities are
DKL(P1∣∣P2)=s1:T∈ST∑f(y1:T∣θ′)f(s1:T,y1:T∣θ′)log(f(s1:T,y1:T∣θ)f(y1:T∣θ′)f(s1:T,y1:T∣θ′)f(y1:T∣θ))=logf(y1:T∣θ′)f(y1:T∣θ)+s1:T∈ST∑f(y1:T∣θ′)f(s1:T,y1:T∣θ′)log(f(s1:T,y1:T∣θ)f(s1:T,y1:T∣θ′))=logf(y1:T∣θ′)f(y1:T∣θ)+Q(θ′,θ′)−Q(θ,θ′) Since KL divergence is greater than 0, we have
logf(y1:T∣θ)−logf(y1:T∣θ′)≥Q(θ,θ′)−Q(θ′,θ′) This implies that if the arbitrary parameter θ is chosen such that Q(θ,θ′)≥Q(θ′,θ′), we will increase the log likelihood of the data. At maximization step, θ is chosen by maximizing Q(θ,θ′), so the greater or equal criteria is satisfied by definition and thus guarantee log likelihood to be increasing (hence local maximum).
Note that technically, the maximization doesn't need to pick θ to maximize Q(θ,θ′). As long as the new θ satisfies the greater inequality, we can gurantee log likelihood to be always increasing.This version is called generalized EM algorithm