Mixture Model Parameter Estimation

Visser & Speekenbrink: Sec 2.3

Maximum Likelihood Estimator

MLE is defined as the parameters that maximize the likelihood function defined in pthe revious chapter given the observation (treating observation as fixed)

\begin{equation*} \hat{\theta} = \argmax_\theta L(\theta | y_{1:T}) \end{equation*}

Label Switching

In a mixture model, the components can permute and still have the same distribution. (e.g.: If there are two normal distribution states, just swap the parameters for state 1 with parameters for state 2). It has more severe consequences when using Bayesian parameter estimation and inference techniques, and bootstrapping. In that case, relabeling components to make them consistent over iterations becomes necessary.

Method 1: Numerical Optimization of the Likelihood

Numerical optimization can iteratively find the minimum of a function. Look at the below trivial example

GaussMix2 <- function(par, y){
# set parameter values 
p1 <- par[1]; p2 <- 1 - par[1]
m1 <- par[2]; m2 <- par[3]; s1 <- par[4]; s2 <- par[5]
# return negative log-likelihood 
-sum(log(p1 * dnorm(y, mean = m1, sd = s1) + p2 * dnorm(y, mean = m2, sd = s2)))

}

# define starting values 
pstart <- c("p1"=.5,"m1"=1,"m2"=2,"s1"=.5,"s2"=.5)

# create a fake mixture Gaussian 
y_obs <- c(rnorm(1000,0,1), rnorm(1000,5,2))

# run optim 
opt <- optim(par=pstart,fn=GaussMix2,y=y_obs,lower=c(0,-Inf,-Inf,0,0),upper=c(1,rep(Inf,4)),method="L-BFGS-B")

# read output 
opt$par 

#        p1         m1         m2         s1         s2 
# 0.50854228 0.03480772 5.02649332 0.98800874 2.03568627

GaussMix2 is a function that takes in parameters and observation y. It then finds the minimum of the negative log-likelihood: $\sum_{t=1}^N (-1) * (\pi_1f_1(y_t) + \pi_2f_2(y_t))$
optim arguments passed are
- par = pstart: Initial values for the parameters to be optimized over
- fn = GaussMix2: The function to be minimized. The first argument is the vector of parameters that needs to be minimized
- y=y_obs: the additional argument required for fn
- The requirement of mixing probabilities has to be between 0 and 1 and positive standard deviation is enforced through lower and upper arguments

Expectation Maximization (EM) Algorithm

EM Goal and Use Canse:

Parameter estimation for problems with missing or latent data and maximum likelihood estimation would be easy if we knew the missing or latent values.

EM Idea

Impute expected values for the latent data/states and then estimate parameters by optimizing the joint likelihood of the parameters given the observation and imputed latent states. Then we can separately perform the following step

Expectation step: (Re)compute the expectation of the complete-data log likelihood (The likelihood of observed data + hidden states).
Maximization step: Expected complete-data log-likelihood is a function of parameters $\theta$ , but not the unobserved state $S$ . So we can find the optimal values of $\theta$ by maximizing the expected complete-data log-likelihood.

EM in Mixture Model

The joint log-likelihood function (complete-data log-likelihood) can be decomposed into two parts:

\begin{equation*} \log f(y_{1:T}, s_{1:T}|\theta) = \sum_{t=1}^T \log P(s_t|\theta_{pr}) + \sum_{t=1}^T \log f(y_t|s_t, \theta_{obs}) \end{equation*}

Compared to the model log-likelihood, the expression is the joint log-likelihood of data and the hidden states. The hidden states are normally unavailable in the data.
The first part is the probability of $S_t$ (the state at the time $t$ ) given the mixing probability parameter $\theta_{pr}$ . The second part is the probability of $Y_t$ (observation at time $t$ ) given the state at that time and the density parameters $\theta_{obs}$ .

The expected value of complete-data log-likelihood w.r.t all possible hidden states across each observation ( $s_{1:T}$ ) given the observations and initial/previous guess parameters $\theta'$ :

\begin{align*} Q(\theta, \theta') & \equiv E_{s_{1:T}|y_{1:T}, \theta'}\left[ \log f(y_{1:T}, s_{1:T}|\theta)\right] \\ & = \sum_{s_{1:T}\in S^T} P(s_{1:T} | y_{1:T},\theta')\log f(y_{1:T}, s_{1:T}|\theta)\\ & = \sum_{s_{1:T}\in S^T}P(s_{1:T} | y_{1:T},\theta') * \log (P(s_{1:T}|\theta_{pr}) * f(y_{1:T}|s_{1:T},\theta_{obs})\\ & = \sum_{t=1}^T \sum_{i=1}^N \gamma_t(i) \log P(s_t = i|\theta_{pr}) + \sum_{t=1}^T \sum_{i=1}^N \gamma_t(i)\log f(y_t | s_t = i, \theta_{obs}) \\ \gamma_t(i) & = p(s_t=i|y_t, \theta') \end{align*}

The expectation is a function of the true parameter and the initial guess parameters
The first equal sign is applying the definition of expectation.
The second equal sign is applying the definition of complete joint log-likelihood.
Notice $s_{1:T}\in S^T$ is just a shorthand for all possible permutations of hidden states from time 1 to T.
$\gamma$ function is the posterior probabilities of the state given the initial/previous guess parameters.

So overall, the EM algorithm for the mixture model can be described as

Start with an initial set of parameters $\theta'$
Repeat until convergence:
1. Compute the posterior probabilities $\gamma$ and the log-likelihood $l(\theta'|y_{1:T})$ as described in the previous chapter.
2. Obtain new estimates
3. Set $\theta'=(\hat{\theta}{pr}, \hat{\theta}{obs})$

The convergence can be checked by 1) the norm of the parameter guesses or 2) the relative increase in the log-likelihood. This can be done in R through package depmixS4.

In summary, EM algorithm for Mixture Class Model alternative between

computing the expected component membership probabilities or the posterior state probabilities
optimizing the response model parameters conditional on these expected component memberships

Handling Constraint

Linear (in)equality constraint can be represented as

I \leq A\theta\leq U

And numerical optimization can be performed using packages such as Rsolnp, Rdonlp2, depmixS4

Starting Values

We have two choices, especially concerning EM algorithm.

Start with density parameter values for components (obtained through methods such as K-means) and equal value for mixing probability
Start with posterior probability (such as random assignment of each case to a component, resulting in 1 hot vector for posterior probability) and directly go to maximization step in the first iteration

PreviousMixture Model Setup NextEM Algorithm for Gaussian Mixture

Last updated 8 months ago