EM Algorithm and Mixtures of Generalized Linear Models

Generalized Linear Models (GLM)

Component of GLM

GLM assumes the observations follows an exponential distribution

f(yt)=exp(wtytθtb(θt)θ+c(yt,ϕ/wt))\begin{equation*} f(y_t) = \exp \left(w_t \frac{y_t\theta_t - b(\theta_t)}{\theta} + c(y_t, \phi / w_t)\right) \end{equation*}
  • θt,ϕ\theta_t, \phi : location and scale parameters

    • often ϕ\phi is a known constant, such as for Binomial and Poisson distribution it equals 1. If it is unkown, it is estimated by method of moments. Note that θt\theta_t and ϕ\phi are orthogonal parameters, so their estimation does not affect each other

  • wtw_t: known weights

  • b,cb, c: known functions

The mean and variance are

E[yt]=μt=b(θt)Var(yt)=ϕwtb(θt)\begin{align*} & E[y_t] = \mu_t = b'(\theta_t) & Var(y_t) = \frac{\phi}{w_t}b''(\theta_t) \end{align*}

The distribution can also depend on predictors xjx_jthrough a linear function such that the mean is assumed to be a smooth function of the linear combination of predictors

μt=g(ηt)=g(β0+β1x1t++βkxkt)ηt=g1(μt)=h(μt)\begin{align*} & \mu_t = g(\eta_t) = g(\beta_0 + \beta_1x_{1t} + \cdots + \beta_k x_{kt}) \\ & \eta_t = g^{-1}(\mu_t) = h(\mu_t) \end{align*}
  • hh is also called the link function

  • If h=(b)1h=(b')^{-1}, we call this the canonical link function and θt=ηt\theta_t = \eta_t

Model Likelihood

The log-likelihood for TT independent observations from an exponential family distribution is

l(θ1:T,ϕy1:T)=t=1Twtytθtb(θt)ϕ+t=1Tc(yt,ϕ/wt)\begin{equation*} l(\theta_{1:T}, \phi | y_{1:T}) = \sum_{t=1}^T w_t \frac{y_t\theta_t - b(\theta_t)}{\phi} + \sum_{t=1}^Tc(y_t, \phi/w_t) \end{equation*}
  • to note again, likelihood function is not pdf. It is a function w.r.t parameters.

Estimate GLM Through Iteratively Weighted Least Squares

GLM model beta coeffcients can be estimated through Iteratively Weighted Least Squares, and it can be shown that this is equivalent to maximum likelihood estimates.

Start with an initial guess of ηt^\hat{\eta_t} and μt^=h(ηt^)\hat{\mu_t} = h(\hat{\eta_t})

  1. Obtain working response zt=η^t+ytμ^tμtηtz_t = \hat{\eta}_t + \frac{y_t - \hat{\mu}_t}{\frac{\partial\mu_t}{\partial\eta_t}}

  2. Obtain working weight Wt=wtb(θ^t)(μtηt)2W_t = \frac{w_t}{b''(\hat{\theta}_t)}(\frac{\partial\mu_t}{\partial\eta_t})^2

  3. Estimate β^=(XTWX)1XTWz\hat{\beta} = (X^TWX)^{-1}X^TWz

    • XX: model matrix

    • WW: diagonal matrix with the working weights

    • zz: vector of working response

  4. Obtain new estiamte η^=Xβ^\hat{\eta}=X\hat{\beta}

Repeat the steps until convergence

Connection to EM Algorithm and Mixture Model

Summary: We can use IWLS to maximize individual component density numerically if each component density of mixture model is in GLM family.

Recall in mixture model, we can define joint log-likelihood as a function of the data, hidden states, and parameters. The expectation of complete data log-likelihood over all possible hidden states is

Q(θ,θ)=t=1Ti=1Nγt(i)logP(st=iθpr)+t=1Ti=1Nγt(i)logf(ytst=i,θobs)γt(i)=p(st=iyt,θ)\begin{align*} Q(\theta, \theta') & =\sum_{t=1}^T \sum_{i=1}^N \gamma_t(i) \log P(s_t = i|\theta_{pr}) + \sum_{t=1}^T \sum_{i=1}^N \gamma_t(i)\log f(y_t | s_t = i, \theta_{obs}) \\ \gamma_t(i) & = p(s_t=i|y_t, \theta') \end{align*}

Part of the EM process is then to obtain the parameters estiamte that maximizes this expected complete data log likelihood.

θ^obs=θobsargmaxt=1Ti=1Nγt(i)logf(ytst=i,θobs)\begin{equation*} \hat{\theta}_{obs} = \overset{\text{argmax}}{\theta_{obs}} \sum_{t=1}^T \sum_{i=1}^N \gamma_t(i)\log f(y_t|s_t = i,\theta_{obs}) \end{equation*}

If each component density has a separate set of parameters, then they can be estimated separately. So the parameter for component ii can be estimated via maximizing the weighted likelihood:

t=1Tγt(i)logfi(ytst=i,θobs(i))\begin{equation*} \sum_{t=1}^T \gamma_t(i)\log f_i(y_t|s_t = i,\theta_{obs}^{(i)}) \end{equation*}

Notice the tricky part of the EM for Mixture model is that it is not always possible to analytically obtain the actual maximum. But if the density function for component ii is from the generalized linear model family, we can rewrite the formula as

t=1Twtytθtb(θt)ϕ+t=1Tγt(i)c(yt,ϕ/wt)wt=γt(i)wt\begin{align*} & \sum_{t=1}^T w'_t \frac{y_t\theta_t - b(\theta_t)}{\phi} + \sum_{t=1}^T\gamma_t(i)c(y_t, \phi/w_t)\\ & w'_t =\gamma_t(i)w_t \end{align*}

Note that only the first part impacts the estimation of β\beta, and subsequently θ\theta, we can thus rely on IWLS routine using weights wtw'_t.

Further note that we cannot use IWLS to estimate ϕ\phi since the wieghts are not the same (wtw'_t vs. wtw_t). But generally ϕ\phi is estimated through a method of meoments anyway.

Last updated