Mixture Model Parameter Estimation

Visser & Speekenbrink: Sec 2.3

Maximum Likelihood Estimator

MLE is defined as the parameters that maximize the likelihood function defined in pthe revious chapter given the observation (treating observation as fixed)

θ^=arg maxθL(θy1:T)\begin{equation*} \hat{\theta} = \argmax_\theta L(\theta | y_{1:T}) \end{equation*}

Label Switching

In a mixture model, the components can permute and still have the same distribution. (e.g.: If there are two normal distribution states, just swap the parameters for state 1 with parameters for state 2). It has more severe consequences when using Bayesian parameter estimation and inference techniques, and bootstrapping. In that case, relabeling components to make them consistent over iterations becomes necessary.

Method 1: Numerical Optimization of the Likelihood

Numerical optimization can iteratively find the minimum of a function. Look at the below trivial example

GaussMix2 <- function(par, y){
# set parameter values 
p1 <- par[1]; p2 <- 1 - par[1]
m1 <- par[2]; m2 <- par[3]; s1 <- par[4]; s2 <- par[5]
# return negative log-likelihood 
-sum(log(p1 * dnorm(y, mean = m1, sd = s1) + p2 * dnorm(y, mean = m2, sd = s2)))

}

# define starting values 
pstart <- c("p1"=.5,"m1"=1,"m2"=2,"s1"=.5,"s2"=.5)

# create a fake mixture Gaussian 
y_obs <- c(rnorm(1000,0,1), rnorm(1000,5,2))

# run optim 
opt <- optim(par=pstart,fn=GaussMix2,y=y_obs,lower=c(0,-Inf,-Inf,0,0),upper=c(1,rep(Inf,4)),method="L-BFGS-B")

# read output 
opt$par 

#        p1         m1         m2         s1         s2 
# 0.50854228 0.03480772 5.02649332 0.98800874 2.03568627 
  • GaussMix2 is a function that takes in parameters and observation y. It then finds the minimum of the negative log-likelihood:t=1N(1)(π1f1(yt)+π2f2(yt))\sum_{t=1}^N (-1) * (\pi_1f_1(y_t) + \pi_2f_2(y_t))

  • optim arguments passed are

    • par = pstart: Initial values for the parameters to be optimized over

    • fn = GaussMix2: The function to be minimized. The first argument is the vector of parameters that needs to be minimized

    • y=y_obs: the additional argument required for fn

    • The requirement of mixing probabilities has to be between 0 and 1 and positive standard deviation is enforced through lower and upper arguments

Expectation Maximization (EM) Algorithm

EM Goal and Use Canse:

Parameter estimation for problems with missing or latent data and maximum likelihood estimation would be easy if we knew the missing or latent values.

EM Idea

Impute expected values for the latent data/states and then estimate parameters by optimizing the joint likelihood of the parameters given the observation and imputed latent states. Then we can separately perform the following step

  • Expectation step: (Re)compute the expectation of the complete-data log likelihood (The likelihood of observed data + hidden states).

  • Maximization step: Expected complete-data log-likelihood is a function of parameters θ\theta, but not the unobserved state SS. So we can find the optimal values of θ\theta by maximizing the expected complete-data log-likelihood.

EM in Mixture Model

The joint log-likelihood function (complete-data log-likelihood) can be decomposed into two parts:

logf(y1:T,s1:Tθ)=t=1TlogP(stθpr)+t=1Tlogf(ytst,θobs)\begin{equation*} \log f(y_{1:T}, s_{1:T}|\theta) = \sum_{t=1}^T \log P(s_t|\theta_{pr}) + \sum_{t=1}^T \log f(y_t|s_t, \theta_{obs}) \end{equation*}
  • Compared to the model log-likelihood, the expression is the joint log-likelihood of data and the hidden states. The hidden states are normally unavailable in the data.

  • The first part is the probability of StS_t (the state at the time tt) given the mixing probability parameter θpr\theta_{pr}. The second part is the probability of YtY_t (observation at time tt) given the state at that time and the density parameters θobs\theta_{obs}.

The expected value of complete-data log-likelihood w.r.t all possible hidden states across each observation (s1:Ts_{1:T}) given the observations and initial/previous guess parameters θ\theta':

Q(θ,θ)Es1:Ty1:T,θ[logf(y1:T,s1:Tθ)]=s1:TSTP(s1:Ty1:T,θ)logf(y1:T,s1:Tθ)=s1:TSTP(s1:Ty1:T,θ)log(P(s1:Tθpr)f(y1:Ts1:T,θobs)=t=1Ti=1Nγt(i)logP(st=iθpr)+t=1Ti=1Nγt(i)logf(ytst=i,θobs)γt(i)=p(st=iyt,θ)\begin{align*} Q(\theta, \theta') & \equiv E_{s_{1:T}|y_{1:T}, \theta'}\left[ \log f(y_{1:T}, s_{1:T}|\theta)\right] \\ & = \sum_{s_{1:T}\in S^T} P(s_{1:T} | y_{1:T},\theta')\log f(y_{1:T}, s_{1:T}|\theta)\\ & = \sum_{s_{1:T}\in S^T}P(s_{1:T} | y_{1:T},\theta') * \log (P(s_{1:T}|\theta_{pr}) * f(y_{1:T}|s_{1:T},\theta_{obs})\\ & = \sum_{t=1}^T \sum_{i=1}^N \gamma_t(i) \log P(s_t = i|\theta_{pr}) + \sum_{t=1}^T \sum_{i=1}^N \gamma_t(i)\log f(y_t | s_t = i, \theta_{obs}) \\ \gamma_t(i) & = p(s_t=i|y_t, \theta') \end{align*}
  • The expectation is a function of the true parameter and the initial guess parameters

  • The first equal sign is applying the definition of expectation.

  • The second equal sign is applying the definition of complete joint log-likelihood.

  • Notice s1:TSTs_{1:T}\in S^T is just a shorthand for all possible permutations of hidden states from time 1 to T.

  • γ\gamma function is the posterior probabilities of the state given the initial/previous guess parameters.

So overall, the EM algorithm for the mixture model can be described as

  1. Start with an initial set of parameters θ\theta'

  2. Repeat until convergence:

    1. Compute the posterior probabilities γ\gammaand the log-likelihood l(θy1:T)l(\theta'|y_{1:T})as described in the previous chapter.

    2. Obtain new estimates

    3. Set θ=(θ^pr,θ^obs)\theta'=(\hat{\theta}{pr}, \hat{\theta}{obs})

The convergence can be checked by 1) the norm of the parameter guesses or 2) the relative increase in the log-likelihood. This can be done in R through package depmixS4.

In summary, EM algorithm for Mixture Class Model alternative between

  1. computing the expected component membership probabilities or the posterior state probabilities

  2. optimizing the response model parameters conditional on these expected component memberships

Handling Constraint

Linear (in)equality constraint can be represented as

IAθUI \leq A\theta\leq U

And numerical optimization can be performed using packages such as Rsolnp, Rdonlp2, depmixS4

Starting Values

We have two choices, especially concerning EM algorithm.

  1. Start with density parameter values for components (obtained through methods such as K-means) and equal value for mixing probability

  2. Start with posterior probability (such as random assignment of each case to a component, resulting in 1 hot vector for posterior probability) and directly go to maximization step in the first iteration

Last updated