Bayesian Information Criterion Formula Proof

Click For Summary

Discussion Overview

The discussion revolves around the derivation of the Bayesian Information Criterion (BIC) formula, specifically the expression $k \cdot \log(n) - 2 \cdot \log(L)$, where $L$ represents the maximized likelihood function and $k$ denotes the number of parameters. Participants explore theoretical aspects and mathematical reasoning related to this derivation.

Discussion Character

  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant introduces the BIC formula and requests assistance with its proof.
  • Another participant provides a detailed explanation involving the definition of the likelihood and the use of a Taylor expansion to approximate the integral related to the likelihood function.
  • Further elaboration includes the diagonalization of the Hessian matrix and the evaluation of the Gaussian integral, with assumptions about the nature of the second derivatives.
  • A link to an external resource is shared for additional reference on the derivation.
  • A later reply indicates that the initial query has been resolved, suggesting understanding has been achieved.

Areas of Agreement / Disagreement

While one participant expresses satisfaction with the explanation provided, the discussion does not indicate a formal consensus on all aspects of the derivation, as it primarily consists of individual contributions and clarifications.

Contextual Notes

The discussion includes assumptions about the behavior of the likelihood function and the properties of the Hessian matrix, which may not be universally applicable without further context.

mertcan
Messages
343
Reaction score
6
Hi everyone, while I was digging arima model I saw that BIC value is given as $k*log(n)-2*log(L)$ where $L$ is the maximized value of likelihood function whereas $k$ is number of parameters. I have found the proof of AIC but no any clue about that. I wonder how it is derived. Could you help me with the proof?

Regards;
 
Physics news on Phys.org
Here's the way that I sort of understand it. We have the definition:

##P(\overrightarrow{y} | M) = \int P(\overrightarrow{y} | \overrightarrow{\theta}, M) P(\overrightarrow{\theta} | M) d \overrightarrow{\theta}##

##\overrightarrow{y}## is the vector of observations, ##M## is the model, and ##\overrightarrow{\theta}## is the vector of parameters in the model. Now, let ##Q(\overrightarrow{\theta})## be defined by:

##Q(\overrightarrow{\theta}) = log(P(\overrightarrow{y} | \overrightarrow{\theta}, M) P(\overrightarrow{\theta} | M))##

Then we are trying to approximate the integral:

##\int exp(Q(\overrightarrow{\theta})) d \overrightarrow{\theta}##

What we assume is ##Q## has a maximum at some particular value of the vector ##\overrightarrow{\theta}##, call it ##\overrightarrow{\Theta}##, and that it rapidly declines as you get away from its maximum. Under that assumption, you can approximate ##Q## by a Taylor expansion around its maximum:

##Q(\overrightarrow{\theta}) \approx Q(\overrightarrow{\Theta}) + (\overrightarrow{\theta} - \overrightarrow{\Theta}) \cdot \nabla_{\overrightarrow{\theta}} Q + \frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) ##

where ##H## is a matrix of the second derivatives of ##Q##:

##\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) = \frac{1}{2} \sum_{ij} (\theta^i - \Theta^i) H_{ij} (\theta^j - \Theta^j)##

where
$$H_{ij} = \frac{\partial^2 Q}{\partial \theta^i \partial \theta^j}|_{\overrightarrow{\theta} = \overrightarrow{\Theta}}$$

The maximum of ##Q## occurs when the linear term vanishes. So we have:

##Q(\overrightarrow{\theta}) \approx Q(\overrightarrow{\Theta}) + \frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta}) ##

So the integral becomes:

##\int exp(Q(\overrightarrow{\theta})) d\overrightarrow{\theta} \approx exp(Q(\overrightarrow{\Theta})) \int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta}##
 
  • Like
Likes   Reactions: mertcan
So the integral becomes:

##\int exp(Q(\overrightarrow{\theta})) d\overrightarrow{\theta} \approx exp(Q(\overrightarrow{\Theta})) \int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta}##

So the next step is to diagonalize ##H##. If there is some matrix ##U## such that ##U^T H U## is a diagonal matrix, then we can do a variable change: ##X^i \equiv U^{ij} (\theta^j - \Theta^j)##. Then the above integral becomes:
##\int exp(\frac{1}{2} (\overrightarrow{\theta} - \overrightarrow{\Theta})^T H (\overrightarrow{\theta} - \overrightarrow{\Theta})) d \overrightarrow{\theta} = \int exp(\frac{1}{2} \sum_j H_{jj} (X^j)^2) d \overrightarrow{X}## (There actually should be a Jacobian thrown into account for the coordinate change, but I'm being lazy and hoping the Jacobian is 1). That integral is easily calculated (if all the ##H_{jj}## are negative, which they will be):

##\int exp(\frac{1}{2} \sum_j H_{jj} (X^j)^2) d \overrightarrow{X} = \sqrt{\frac{(2 \pi)^k}{|det(H)|}}##

There is one Gaussian integral for each variable ##X^j##.
 
  • Like
Likes   Reactions: mertcan
To go further, look at this: http://www.math.utah.edu/~hbhat/BICderivation.pdf
 
  • Like
Likes   Reactions: mertcan
Thanks I got it.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 26 ·
Replies
26
Views
4K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
2K
  • · Replies 12 ·
Replies
12
Views
2K
Replies
2
Views
1K