The long version: model should be selected based on maximizes the expected log likelihood with respect to the true distribution. This subsequently means minimizes the KL divergence between the true distribution and the approximated model.
Notice the first term is constant. So the goal is to maximize the second term. The expectation of the log likelihood under true distribution is estimated through the observed log likelihood
Ef∗[logf(Y∣θ)]≈T1t=1∑Tlogf(yt∣θ)
But this estimation is biased because here the data are used to estimate the parameters as well as the expectation. The bias is approximately k/T where k is the number of freely estimated parameters.
So AIC is the bias corrected average log likelihood estimate with a multiplier
If the number of observation is small compare to the maximum number of estimated parameters, we add a second-order bias correction term
AIC=−2∗l(θ^∣y1:T)+2k+T−K−12k(k+1)
AIC is asymptotically efficient. If the true model is not in the set of models under consideration, AIC will asymptotically select the model that minimizes the MSE of prediction. AIC is not consistent.
Bayesian Information Criterion
BIC Summary
BIC=−2∗logf(y1:T∣θ^,M)+k∗logT
BIC Derivation
BIC aim to select the model with the highest posterior probability. Let Mi={f(Y∣θi)∣θi∈Θi}be a candidate model family. We want to focus on posterior probability.
Now if we have no prior information, we can assume that all models have the same prior probability. So the key term is the marginal / integrated likelihood:
The following notes are taken from the paper written by Adrian E. Raftery's 1995 paper in Bayesian model selection on approximation of a single model M estimated at θ∗
At large sample, θ∗ is approximately equal to the MLE estimate. So −g′′(θ∗)≈T∗I , where I is the expected Fisher information matrix for one observation. The expectation being taken over the data, with parameter held fixed. So we have ∣T∗I∣≈Td∣I∣. Those two expectation add the O(T−21) error term
The last step is to ignoring all terms that are not scale with number of observation
d is the number of free parameters
The convention is to use the above estimate and multiplying by minus two, hence
BIC=−2∗logf(y1:T∣θ^,M)+k∗logT
We can obtain approximated posterior model probability
If we assume all models have the same prior, then we can remove the model prior probability. The result relative model probabilities are called BIC weights.
Notice in previous estimate we ignored the term logp(θ∗∣M). We can further improve the criteria by assuming a unit-information prior, i.e., assuming the pdf is a multivariate Normal distribution with mean θ^ and covariance matrix I(θ^)−1. The result just ended up being the same approximator.
BIC is asymptotically consistent. Means it will asymptotically choose the true model (if the true model is under consideration) as number of observation goes up. BIC is not asymptotically efficient.
Citation
Raftery, A. E. (1995). Bayesian Model Selection in Social Research. Sociological Methodology, 25, 111–163. https://doi.org/10.2307/271063
Visser, I., & Maarten Speekenbrink. (2022). Mixture and Hidden Markov Models with R. Springer Nature.