Information Criteria

Akaike Information Criterion

AIC Summary

The short version: (AIC lower is better)

AIC=2l(θ^y1:T)+2k+2k(k+1)TK1\begin{equation*} AIC = -2 * l(\hat{\theta}|y_{1:T}) + 2k + \frac{2k(k+1)}{T-K-1} \end{equation*}

AIC Derivation

The long version: model should be selected based on maximizes the expected log likelihood with respect to the true distribution. This subsequently means minimizes the KL divergence between the true distribution and the approximated model.

D(ff)=f(y)logf(y)f(yθ)y=Ef[logf(Y)]Ef[logf(Yθ)]\begin{equation*} D(f^*||f) = \int f^*(y) \log \frac{f^*(y)}{f(y|\theta)} \partial y = E_{f^*}[\log f^*(Y)] - E_{f^*}[\log f(Y|\theta)] \end{equation*}

Notice the first term is constant. So the goal is to maximize the second term. The expectation of the log likelihood under true distribution is estimated through the observed log likelihood

Ef[logf(Yθ)]1Tt=1Tlogf(ytθ)\begin{equation*} E_{f^*}[\log f(Y|\theta)] \approx \frac{1}{T} \sum_{t=1}^T \log f(y_t | \theta) \end{equation*}

But this estimation is biased because here the data are used to estimate the parameters as well as the expectation. The bias is approximately k/Tk / T where kk is the number of freely estimated parameters.

So AIC is the bias corrected average log likelihood estimate with a multiplier

AIC=2T1Tt=1Tlogf(ytθ)+2k=2l(θ^y1:T)+2k\begin{equation*} AIC = -2 * T * \frac{1}{T} \sum_{t=1}^T \log f(y_t|\theta) + 2k = -2 * l(\hat{\theta}|y_{1:T}) + 2k \end{equation*}

If the number of observation is small compare to the maximum number of estimated parameters, we add a second-order bias correction term

AIC=2l(θ^y1:T)+2k+2k(k+1)TK1\begin{equation*} AIC = -2 * l(\hat{\theta}|y_{1:T}) + 2k + \frac{2k(k+1)}{T-K-1} \end{equation*}

AIC is asymptotically efficient. If the true model is not in the set of models under consideration, AIC will asymptotically select the model that minimizes the MSE of prediction. AIC is not consistent.

Bayesian Information Criterion

BIC Summary

BIC=2logf(y1:Tθ^,M)+klogT\begin{align*} BIC = -2 * \log f(y_{1:T}|\hat{\theta}, M) + k * \log T \end{align*}

BIC Derivation

BIC aim to select the model with the highest posterior probability. Let Mi={f(Yθi)θiΘi}M_i = \{ f(Y|\theta_i)|\theta_i\in\Theta_i \} be a candidate model family. We want to focus on posterior probability.

logP(Miy1:T)=logf(y1:TMi)P(Mi)/p(y1:T)logP(Mi)f(y1:TMi)=logP(Mi)f(y1:Tθi,Mi)f(θiMi)θi=logp(Mi)+logf(y1:Tθi,Mi)f(θiMi)θi\begin{align*} \log P(M_i|y_{1:T}) & = \log f(y_{1:T}|M_i) * P(M_i) / p(y_{1:T})\\ & \propto \log P(M_i) * f(y_{1:T}|M_i) \\ & = \log P(M_i) \int f(y_{1:T}|\theta_i, M_i) f(\theta_i | M_i) \partial \theta_i \\ & = \log p(M_i) + \log \int f(y_{1:T}|\theta_i, M_i) f(\theta_i | M_i) \partial \theta_i \\ \end{align*}

Now if we have no prior information, we can assume that all models have the same prior probability. So the key term is the marginal / integrated likelihood:

logf(y1:TMi)=logf(y1:Tθi,Mi)f(θiMi)θi\begin{equation*} \log f(y_{1:T}|M_i) = \log \int f(y_{1:T}|\theta_i, M_i) f(\theta_i | M_i) \partial \theta_i \end{equation*}

The following notes are taken from the paper written by Adrian E. Raftery's 1995 paper in Bayesian model selection on approximation of a single model MM estimated at θ\theta^*

g(θM)=logp(y1:Tθ,M)p(θM)g(θ)+(θθ)Tg(θ)+12(θθ)Tg(θ)(θθ)+o(θθ2)g(θ)+12(θθ)Tg(θ)(θθ)\begin{align*} g(\theta|M) & = \log p(y_{1:T}|\theta, M) * p(\theta|M) \\ & \approx g(\theta^*) + (\theta - \theta^*)^T g'(\theta^*) + \frac{1}{2}(\theta - \theta^*)^Tg''(\theta^*)(\theta - \theta^*) + o(||\theta - \theta^*||^2) \\ & \approx g(\theta^*) + \frac{1}{2}(\theta - \theta^*)^Tg''(\theta^*)(\theta - \theta^*) \end{align*}
  • Here gg' is the vector of first partial derivatives and gg'' is the Hessian matrix

  • g(θ)=0g'(\theta^*)=0 given θ\theta^* is chosen to maximize the g(θ)g(\theta)function

  • Note from now on we skip writing conditioned on MM but it should still exists

We can then use the function to approximate the log marginal likelihood

p(y1:TM)=expg(θ)θexp(g(θ))exp(12(θθ)Tg(θ)(θθ)θexp(g(θ))((2π)d/2g(θ)12+O(n1))\begin{align*} p(y_{1:T}|M) & = \int \exp g(\theta)\partial \theta \\ & \approx \exp(g(\theta^*)) * \int \exp (\frac{1}{2}(\theta - \theta^*)^Tg''(\theta^*)(\theta - \theta^*) \partial \theta \\ & \approx \exp(g(\theta^*)) \left( (2\pi)^{d/2} |-g''(\theta^*)|^{-\frac{1}{2}} + O(n^{-1}) \right) \end{align*}
  • The integral is approximated through multivariate normal density (Laplace method for integrals)

Adding log we can get

logp(y1:TM)=logexp(g(θ))((2π)d/2g(θ)12+O(T1))=g(θ)+d2log(2π)12logg(θ)+O(T1)=logp(y1:Tθ,M)+logp(θM)+d2log(2π)12logg(θ)+O(T1)logp(y1:Tθ,M)+logp(θM)+d2log(2π)d2logT12logI+O(T12)logp(y1:Tθ,M)d2logT+O(1)\begin{align*} \log p(y_{1:T}|M) & = \log \exp(g(\theta^*)) \left( (2\pi)^{d/2} |-g''(\theta^*)|^{-\frac{1}{2}} + O(T^{-1}) \right)\\ & = g(\theta^*) + \frac{d}{2} \log(2\pi) - \frac{1}{2}\log |-g''(\theta^*)| + O(T^{-1}) \\ & = \log p(y_{1:T}|\theta^*, M) + \log p(\theta^*|M) + \frac{d}{2} \log(2\pi) - \frac{1}{2}\log |-g''(\theta^*)| + O(T^{-1}) \\ & \approx \log p(y_{1:T}|\theta^*, M) + \log p(\theta^*|M) + \frac{d}{2} \log(2\pi) - \frac{d}{2}\log T - \frac{1}{2}\log |I| + O(T^{-\frac{1}{2}}) \\ & \approx \log p(y_{1:T}|\theta^*, M) - \frac{d}{2}\log T + O(1) \end{align*}
  • At large sample, θ\theta^* is approximately equal to the MLE estimate. So g(θ)TI-g''(\theta^*)\approx T*I , where II is the expected Fisher information matrix for one observation. The expectation being taken over the data, with parameter held fixed. So we have TITdI|T*I|\approx T^d|I|. Those two expectation add the O(T12)O(T^{-\frac{1}{2}}) error term

  • The last step is to ignoring all terms that are not scale with number of observation

  • dd is the number of free parameters

The convention is to use the above estimate and multiplying by minus two, hence

BIC=2logf(y1:Tθ^,M)+klogT\begin{align*} BIC = -2 * \log f(y_{1:T}|\hat{\theta}, M) + k * \log T \end{align*}

We can obtain approximated posterior model probability

P(Miy1:T)exp(12BICi)P(Mi)j=1Nexp(12BICj)P(Mj)\begin{align*} P(M_i|y_{1:T}) \approx \frac{\exp(-\frac{1}{2} BIC_i) P(M_i)}{\sum_{j=1}^N \exp(-\frac{1}{2} BIC_j) P(M_j)} \end{align*}

If we assume all models have the same prior, then we can remove the model prior probability. The result relative model probabilities are called BIC weights.

Notice in previous estimate we ignored the term logp(θM)\log p(\theta^*|M). We can further improve the criteria by assuming a unit-information prior, i.e., assuming the pdf is a multivariate Normal distribution with mean θ^\hat{\theta} and covariance matrix I(θ^)1I(\hat{\theta})^{-1}. The result just ended up being the same approximator.

BIC is asymptotically consistent. Means it will asymptotically choose the true model (if the true model is under consideration) as number of observation goes up. BIC is not asymptotically efficient.

Citation

  1. Raftery, A. E. (1995). Bayesian Model Selection in Social Research. Sociological Methodology, 25, 111–163. https://doi.org/10.2307/271063

  2. Visser, I., & Maarten Speekenbrink. (2022). Mixture and Hidden Markov Models with R. Springer Nature.

Last updated