Bayesian Information Criterion Formula Proof

mertcan · Apr 13, 2019

Hi everyone, while I was digging arima model I saw that BIC value is given as $k*log(n)-2*log(L)$ where $L$ is the maximized value of likelihood function whereas $k$ is number of parameters. I have found the proof of AIC but no any clue about that. I wonder how it is derived. Could you help me with the proof?

Regards;

stevendaryl · Apr 15, 2019

Here's the way that I sort of understand it. We have the definition:

##P(\overrightarrow{y} | M) = \int P(\overrightarrow{y} | \overrightarrow{\theta}, M) P(\overrightarrow{\theta} | M) d \overrightarrow{\theta}##

##\overrightarrow{y}## is the vector of observations, ##M## is the model, and ##\overrightarrow{\theta}## is the vector of parameters in the model. Now, let ##Q(\overrightarrow{\theta})## be defined by:

##Q(\overrightarrow{\theta}) = log(P(\overrightarrow{y} | \overrightarrow{\theta}, M) P(\overrightarrow{\theta} | M))##

Then we are trying to approximate the integral:

##\int exp(Q(\overrightarrow{\theta})) d \overrightarrow{\theta}##

What we assume is ##Q## has a maximum at some particular value of the vector ##\overrightarrow{\theta}##, call it ##\overrightarrow{\Theta}##, and that it rapidly declines as you get away from its maximum. Under that assumption, you can approximate ##Q## by a Taylor expansion around its maximum:

##Q(\overrightarrow{\theta}) \approx Q(\overrightarrow{\Theta}) + (\overrightarrow{\theta} - \overrightarrow{\Theta}) \cdot \nabla_{\overrightarrow{\theta}} Q + \frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) ##

where ##H## is a matrix of the second derivatives of ##Q##:

##\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) = \frac{1}{2} \sum_{ij} (\theta^i - \Theta^i) H_{ij} (\theta^j - \Theta^j)##

where
$$H_{ij} = \frac{\partial^2 Q}{\partial \theta^i \partial \theta^j}|_{\overrightarrow{\theta} = \overrightarrow{\Theta}}$$

The maximum of ##Q## occurs when the linear term vanishes. So we have:

##Q(\overrightarrow{\theta}) \approx Q(\overrightarrow{\Theta}) + \frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) ##

So the integral becomes:

##\int exp(Q(\overrightarrow{\theta})) d\overrightarrow{\theta} \approx exp(Q(\overrightarrow{\Theta})) \int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta}##

stevendaryl · Apr 15, 2019

So the integral becomes:

##\int exp(Q(\overrightarrow{\theta})) d\overrightarrow{\theta} \approx exp(Q(\overrightarrow{\Theta})) \int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta}##

So the next step is to diagonalize ##H##. If there is some matrix ##U## such that ##U^T H U## is a diagonal matrix, then we can do a variable change: ##X^i \equiv U^{ij} (\theta^j - \Theta^j)##. Then the above integral becomes:
##\int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta} = \int exp(\frac{1}{2} \sum_j H_{jj} (X^j)^2) d \overrightarrow{X}## (There actually should be a Jacobian thrown into account for the coordinate change, but I'm being lazy and hoping the Jacobian is 1). That integral is easily calculated (if all the ##H_{jj}## are negative, which they will be):

##\int exp(\frac{1}{2} \sum_j H_{jj} (X^j)^2) d \overrightarrow{X} = \sqrt{\frac{(2 \pi)^k}{|det(H)|}}##

There is one Gaussian integral for each variable ##X^j##.

stevendaryl · Apr 15, 2019

To go further, look at this: http://www.math.utah.edu/~hbhat/BICderivation.pdf

mertcan · Apr 16, 2019

Thanks I got it.

Bayesian Information Criterion Formula Proof

1. What is the Bayesian Information Criterion (BIC)?

2. How is the BIC formula derived?

3. What is the significance of the BIC value?

4. Can the BIC be used for any type of model?

5. Are there any limitations to using the BIC?

Similar threads

Hot Threads

Recent Insights