The proof relies on KL divergence
Given parameters θ \theta θ (arbitrary parameter) and θ ′ \theta' θ ′ (previous guess), and observation, we can define two posterior state sequence (states of the sequence of observation) distributions:
P 1 ( s 1 : T ) = f ( s 1 : T , y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ′ ) P 2 ( s 1 : T ) = f ( s 1 : T , y 1 : T ∣ θ ) f ( y 1 : T ∣ θ ) \begin{align*}
P_1(s_{1:T}) &= \frac{f(s_{1:T}, y_{1:T}|\theta')}{f(y_{1:T}|\theta')} \\
P_2(s_{1:T}) &= \frac{f(s_{1:T}, y_{1:T}|\theta)}{f(y_{1:T}|\theta)}
\end{align*} P 1 ( s 1 : T ) P 2 ( s 1 : T ) = f ( y 1 : T ∣ θ ′ ) f ( s 1 : T , y 1 : T ∣ θ ′ ) = f ( y 1 : T ∣ θ ) f ( s 1 : T , y 1 : T ∣ θ ) Now recall the expected complete-data log-likelihood and apply the definition of posterior sequence probability above
Q ( θ , θ ′ ) ≡ E s 1 : T ∣ y 1 : T , θ ′ [ log f ( y 1 : T , s 1 : T ∣ θ ) ] = ∑ s 1 : T ∈ S T P 1 ( s 1 : T ) ∗ log f ( y 1 : T , s 1 : T ∣ θ ) = 1 f ( y 1 : T ∣ θ ′ ) ∗ ∑ s 1 : T ∈ S T f ( s 1 : T , y 1 : T ∣ θ ′ ) ∗ log f ( y 1 : T , s 1 : T ∣ θ ) \begin{align*}
Q(\theta, \theta')
& \equiv E_{s_{1:T}|y_{1:T}, \theta'}\left[ \log f(y_{1:T}, s_{1:T}|\theta)\right] \\
& = \sum_{s_{1:T}\in S^T} P_1(s_{1:T}) * \log f(y_{1:T}, s_{1:T}|\theta)\\
& = \frac{1}{f(y_{1:T}|\theta')} * \sum_{s_{1:T}\in S^T} f(s_{1:T}, y_{1:T}|\theta') * \log f(y_{1:T}, s_{1:T}|\theta)\\
\end{align*} Q ( θ , θ ′ ) ≡ E s 1 : T ∣ y 1 : T , θ ′ [ log f ( y 1 : T , s 1 : T ∣ θ ) ] = s 1 : T ∈ S T ∑ P 1 ( s 1 : T ) ∗ log f ( y 1 : T , s 1 : T ∣ θ ) = f ( y 1 : T ∣ θ ′ ) 1 ∗ s 1 : T ∈ S T ∑ f ( s 1 : T , y 1 : T ∣ θ ′ ) ∗ log f ( y 1 : T , s 1 : T ∣ θ ) So the KL divergence of two posterior probabilities are
D K L ( P 1 ∣ ∣ P 2 ) = ∑ s 1 : T ∈ S T f ( s 1 : T , y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ′ ) log ( f ( s 1 : T , y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ) f ( s 1 : T , y 1 : T ∣ θ ) f ( y 1 : T ∣ θ ′ ) ) = log f ( y 1 : T ∣ θ ) f ( y 1 : T ∣ θ ′ ) + ∑ s 1 : T ∈ S T f ( s 1 : T , y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ′ ) log ( f ( s 1 : T , y 1 : T ∣ θ ′ ) f ( s 1 : T , y 1 : T ∣ θ ) ) = log f ( y 1 : T ∣ θ ) f ( y 1 : T ∣ θ ′ ) + Q ( θ ′ , θ ′ ) − Q ( θ , θ ′ ) \begin{align*}
D_{KL}(P_1||P_2)
& = \sum_{s_{1:T}\in S^T} \frac{f(s_{1:T},y_{1:T}|\theta')}{f(y_{1:T}|\theta')}
\log \left( \frac{f(s_{1:T},y_{1:T}|\theta')f(y_{1:T}|\theta)}{f(s_{1:T},y_{1:T}|\theta)f(y_{1:T}|\theta')} \right) \\
& = \log \frac{f(y_{1:T}|\theta)}{f(y_{1:T}|\theta')} + \sum_{s_{1:T}\in S^T} \frac{f(s_{1:T},y_{1:T}|\theta')}{f(y_{1:T}|\theta')}
\log \left( \frac{f(s_{1:T},y_{1:T}|\theta')}{f(s_{1:T},y_{1:T}|\theta)} \right) \\
& = \log \frac{f(y_{1:T}|\theta)}{f(y_{1:T}|\theta')} + Q(\theta', \theta') - Q(\theta, \theta') \\
\end{align*} D K L ( P 1 ∣∣ P 2 ) = s 1 : T ∈ S T ∑ f ( y 1 : T ∣ θ ′ ) f ( s 1 : T , y 1 : T ∣ θ ′ ) log ( f ( s 1 : T , y 1 : T ∣ θ ) f ( y 1 : T ∣ θ ′ ) f ( s 1 : T , y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ) ) = log f ( y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ) + s 1 : T ∈ S T ∑ f ( y 1 : T ∣ θ ′ ) f ( s 1 : T , y 1 : T ∣ θ ′ ) log ( f ( s 1 : T , y 1 : T ∣ θ ) f ( s 1 : T , y 1 : T ∣ θ ′ ) ) = log f ( y 1 : T ∣ θ ′ ) f ( y 1 : T ∣ θ ) + Q ( θ ′ , θ ′ ) − Q ( θ , θ ′ ) Since KL divergence is greater than 0, we have
log f ( y 1 : T ∣ θ ) − log f ( y 1 : T ∣ θ ′ ) ≥ Q ( θ , θ ′ ) − Q ( θ ′ , θ ′ ) \begin{align*}
\log f(y_{1:T}|\theta) - \log f(y_{1:T}|\theta') \geq Q(\theta, \theta') - Q(\theta', \theta')
\end{align*} log f ( y 1 : T ∣ θ ) − log f ( y 1 : T ∣ θ ′ ) ≥ Q ( θ , θ ′ ) − Q ( θ ′ , θ ′ ) This implies that if the arbitrary parameter θ \theta θ is chosen such that Q ( θ , θ ′ ) ≥ Q ( θ ′ , θ ′ ) Q(\theta, \theta')\geq Q(\theta', \theta') Q ( θ , θ ′ ) ≥ Q ( θ ′ , θ ′ ) , we will increase the log likelihood of the data. At maximization step, θ \theta θ is chosen by maximizing Q ( θ , θ ′ ) Q(\theta, \theta') Q ( θ , θ ′ ) , so the greater or equal criteria is satisfied by definition and thus guarantee log likelihood to be increasing (hence local maximum).
Note that technically, the maximization doesn't need to pick θ \theta θ to maximize Q ( θ , θ ′ ) Q(\theta, \theta') Q ( θ , θ ′ ) . As long as the new θ \theta θ satisfies the greater inequality, we can gurantee log likelihood to be always increasing.This version is called generalized EM algorithm
Last updated 3 months ago