Bayesian Information Criterion Formula Proof

In summary, the BIC value in the ARIMA model is given by $k*log(n)-2*log(L)$, where $L$ is the maximized value of the likelihood function and $k$ is the number of parameters. The proof involves approximating the integral using a Taylor expansion and diagonalizing the matrix of second derivatives. This results in a Gaussian integral for each variable, leading to the final BIC equation.
  • #1
mertcan
340
6
Hi everyone, while I was digging arima model I saw that BIC value is given as $k*log(n)-2*log(L)$ where $L$ is the maximized value of likelihood function whereas $k$ is number of parameters. I have found the proof of AIC but no any clue about that. I wonder how it is derived. Could you help me with the proof?

Regards;
 
Physics news on Phys.org
  • #2
Here's the way that I sort of understand it. We have the definition:

##P(\overrightarrow{y} | M) = \int P(\overrightarrow{y} | \overrightarrow{\theta}, M) P(\overrightarrow{\theta} | M) d \overrightarrow{\theta}##

##\overrightarrow{y}## is the vector of observations, ##M## is the model, and ##\overrightarrow{\theta}## is the vector of parameters in the model. Now, let ##Q(\overrightarrow{\theta})## be defined by:

##Q(\overrightarrow{\theta}) = log(P(\overrightarrow{y} | \overrightarrow{\theta}, M) P(\overrightarrow{\theta} | M))##

Then we are trying to approximate the integral:

##\int exp(Q(\overrightarrow{\theta})) d \overrightarrow{\theta}##

What we assume is ##Q## has a maximum at some particular value of the vector ##\overrightarrow{\theta}##, call it ##\overrightarrow{\Theta}##, and that it rapidly declines as you get away from its maximum. Under that assumption, you can approximate ##Q## by a Taylor expansion around its maximum:

##Q(\overrightarrow{\theta}) \approx Q(\overrightarrow{\Theta}) + (\overrightarrow{\theta} - \overrightarrow{\Theta}) \cdot \nabla_{\overrightarrow{\theta}} Q + \frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) ##

where ##H## is a matrix of the second derivatives of ##Q##:

##\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) = \frac{1}{2} \sum_{ij} (\theta^i - \Theta^i) H_{ij} (\theta^j - \Theta^j)##

where
$$H_{ij} = \frac{\partial^2 Q}{\partial \theta^i \partial \theta^j}|_{\overrightarrow{\theta} = \overrightarrow{\Theta}}$$

The maximum of ##Q## occurs when the linear term vanishes. So we have:

##Q(\overrightarrow{\theta}) \approx Q(\overrightarrow{\Theta}) + \frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) ##

So the integral becomes:

##\int exp(Q(\overrightarrow{\theta})) d\overrightarrow{\theta} \approx exp(Q(\overrightarrow{\Theta})) \int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta}##
 
  • Like
Likes mertcan
  • #3
So the integral becomes:

##\int exp(Q(\overrightarrow{\theta})) d\overrightarrow{\theta} \approx exp(Q(\overrightarrow{\Theta})) \int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta}##

So the next step is to diagonalize ##H##. If there is some matrix ##U## such that ##U^T H U## is a diagonal matrix, then we can do a variable change: ##X^i \equiv U^{ij} (\theta^j - \Theta^j)##. Then the above integral becomes:
##\int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta} = \int exp(\frac{1}{2} \sum_j H_{jj} (X^j)^2) d \overrightarrow{X}## (There actually should be a Jacobian thrown into account for the coordinate change, but I'm being lazy and hoping the Jacobian is 1). That integral is easily calculated (if all the ##H_{jj}## are negative, which they will be):

##\int exp(\frac{1}{2} \sum_j H_{jj} (X^j)^2) d \overrightarrow{X} = \sqrt{\frac{(2 \pi)^k}{|det(H)|}}##

There is one Gaussian integral for each variable ##X^j##.
 
  • Like
Likes mertcan
  • #4
To go further, look at this: http://www.math.utah.edu/~hbhat/BICderivation.pdf
 
  • Like
Likes mertcan
  • #5
Thanks I got it.
 

1. What is the Bayesian Information Criterion (BIC)?

The Bayesian Information Criterion (BIC) is a statistical measure used to evaluate the fit of a statistical model to a given dataset. It is based on the principle of Occam's razor, which states that simpler models are preferred over more complex ones, and penalizes models for having more parameters.

2. How is the BIC formula derived?

The BIC formula is derived from the Akaike Information Criterion (AIC), which is a similar measure used for model selection. The BIC formula adds a penalty term to the AIC formula, which takes into account the number of parameters in the model. This penalty term is based on the Bayesian approach to model selection.

3. What is the significance of the BIC value?

The BIC value can be used to compare different models and determine which one is the best fit for a given dataset. A lower BIC value indicates a better fit, as it means the model is able to explain the data with fewer parameters. However, the BIC value should not be used in isolation, and should be considered along with other measures and the specific context of the problem.

4. Can the BIC be used for any type of model?

The BIC can be used for any type of model, as long as it is fitted using maximum likelihood estimation. This includes linear regression, logistic regression, and many other commonly used statistical models. However, it should not be used for models that are not fitted using maximum likelihood estimation, such as Bayesian models.

5. Are there any limitations to using the BIC?

Like any statistical measure, the BIC also has its limitations. It assumes that the true model is included in the set of candidate models being compared, which may not always be the case. Additionally, it does not take into account the complexity of the data or the possibility of overfitting. Therefore, it should be used in conjunction with other measures and careful consideration of the specific problem at hand.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
935
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
Back
Top