- #1

- 215

- 57

- TL;DR Summary
- Check my calculation

Greetings!

I believe I found an error in a paper to Bayesian neural networks. I think the expression of the covariance of the posterior predictive is wrong, and I wrote down my own calculation. Would be great if a seasoned Bayesian could take a look.

Imagine a regression scenario. We want to learn a function ##f_{\theta}: \mathbb{R}^m \rightarrow \mathbb{R}^n ## such that it fits the data set ##X=\{ ( x_i, y_i)| i=1,...,N \}## as good as possible, where the ##x_i## is the model input and ##y_i## the target. The data points are i.i.d. The function is parametrized by ##\theta \in \mathbb{R}^d##. Given ##\theta##, we expect the desired target ##y## to a given input ##x## to be normally distributed around ##f_{\theta}(x)##. This gives the likelihood function

$$

p(y|x,\theta) = \mathcal{N}(f_{\theta}(x),\Sigma).

$$

Suppose now that we wrote down the posterior ##p(\theta| X)## and, by whatever means, we obtained samples from it,

$$

S=\{ \theta_i | \theta_i \sim p(\theta| X), i=1,...M \}.

$$

The posterior predictive distribution (typically applied to an ##x## that was not seen before) is now given by

$$

p(y | x, X) = \int p(y|x, \theta) p(\theta | X) \text{d} \theta.

$$

The final prediction of our model can now be written as an average over the predictions given by the posterior samples, i.e.

$$

\begin{align}

\hat{y} & = \mathbb{E}_{y|x,X}(y) \\

& = \int y p(y | x, X) \text{d}y \\

& = \int y \int p(y|x, \theta) p(\theta | X) \text{d} \theta \text{d}y \\

& = \int \mathbb{E}_{y|x,\theta}(y) p(\theta | X) \text{d} \theta \\

& = \mathbb{E}_{\theta | X} \Big( \mathbb{E}_{y|x,\theta}(y) \Big) \\

& = \mathbb{E}_{\theta | X} \Big( f_{\theta}(x) \Big) \\

& \approx \frac{1}{|S|-1} \sum_{\theta_{i} \in S} f_{\theta_{i}}(x),

\end{align}

$$

where in the penultimate line we used the likelihood function from above and the last line is the typical Monte Carlo estimator for the true posterior mean.

To quantify the uncertainty of our prediction ##\hat{y}##, one can use the covariance matrix ##\Sigma_{y|x,X}##. In the paper, the authors give the formula (without calculation)

$$

\Sigma_{y|x,X} = \approx \frac{1}{|S|-1} \sum_{\theta_{i} \in S} (f_{\theta_{i}}(x) - \hat{y})(f_{\theta_{i}}(x) - \hat{y})^{T}.

$$

I think this is wrong. The covariance is

$$

\Sigma_{y|x,X}=\mathbb{E}_{y|x,X}\Big((y-\hat{y})(y-\hat{y})^{T}\Big).

$$

Following the computation of ##\mathbb{E}_{y|x,X}(y)## from the beginning with ##(y-\hat{y})(y-\hat{y})^{T}## inserted for ##y##, one arrives at

$$

\Sigma_{y|x,X} = \mathbb{E}_{\theta | X} \Bigg( \mathbb{E}_{y|x,\theta}\Big((y-\hat{y})(y-\hat{y})^{T}\Big) \Bigg).

$$

It looks like they now just set ##\mathbb{E}_{y|x,\theta}\Big((y-\hat{y})(y-\hat{y})^{T}\Big) = (\mathbb{E}_{y|x,\theta}(y)-\hat{y})(\mathbb{E}_{y|x,\theta}(y)-\hat{y})^{T} ##, which would then reduce to their expression. But since the bracket term is a product, this should not be allowed as we typically have ##\big(E(V)\big)^2\neq E(V^2)##.

I tried to derive the correct expression on my own:

We use the other formula for the covariance,

$$

\begin{align}

\Sigma_{y|x,X} &= \mathbb{E}_{y|x,X} (yy^T) - \hat{y}\hat{y}^{T} \\

&= \mathbb{E}_{\theta | X} \Big( \mathbb{E}_{y|x,\theta}(yy^T) \Big) - \hat{y}\hat{y}^{T} \\

& \approx \frac{1}{|S|-1} \sum_{\theta_{i} \in S}\mathbb{E}_{y|x,\theta}(yy^T) - \hat{y}\hat{y}^{T} \\

&= \frac{1}{|S|-1} \sum_{\theta_{i} \in S}\Big( \Sigma + f_{\theta_{i}}(x)f_{\theta_{i}}(x)^{T}\Big) - \hat{y}\hat{y}^{T},

\end{align}

$$

where the last line comes from the definition of the likelihood ##p(y|x,\theta)## and the property that ##\Sigma = \mathbb{E}_{y|x,\theta}(yy^{T})-\mathbb{E}_{y|x,\theta}(y) \mathbb{E}_{y|x,\theta}(y^T)##.

Do you agree with all of this?

Cheers!

SL

I believe I found an error in a paper to Bayesian neural networks. I think the expression of the covariance of the posterior predictive is wrong, and I wrote down my own calculation. Would be great if a seasoned Bayesian could take a look.

Imagine a regression scenario. We want to learn a function ##f_{\theta}: \mathbb{R}^m \rightarrow \mathbb{R}^n ## such that it fits the data set ##X=\{ ( x_i, y_i)| i=1,...,N \}## as good as possible, where the ##x_i## is the model input and ##y_i## the target. The data points are i.i.d. The function is parametrized by ##\theta \in \mathbb{R}^d##. Given ##\theta##, we expect the desired target ##y## to a given input ##x## to be normally distributed around ##f_{\theta}(x)##. This gives the likelihood function

$$

p(y|x,\theta) = \mathcal{N}(f_{\theta}(x),\Sigma).

$$

Suppose now that we wrote down the posterior ##p(\theta| X)## and, by whatever means, we obtained samples from it,

$$

S=\{ \theta_i | \theta_i \sim p(\theta| X), i=1,...M \}.

$$

The posterior predictive distribution (typically applied to an ##x## that was not seen before) is now given by

$$

p(y | x, X) = \int p(y|x, \theta) p(\theta | X) \text{d} \theta.

$$

The final prediction of our model can now be written as an average over the predictions given by the posterior samples, i.e.

$$

\begin{align}

\hat{y} & = \mathbb{E}_{y|x,X}(y) \\

& = \int y p(y | x, X) \text{d}y \\

& = \int y \int p(y|x, \theta) p(\theta | X) \text{d} \theta \text{d}y \\

& = \int \mathbb{E}_{y|x,\theta}(y) p(\theta | X) \text{d} \theta \\

& = \mathbb{E}_{\theta | X} \Big( \mathbb{E}_{y|x,\theta}(y) \Big) \\

& = \mathbb{E}_{\theta | X} \Big( f_{\theta}(x) \Big) \\

& \approx \frac{1}{|S|-1} \sum_{\theta_{i} \in S} f_{\theta_{i}}(x),

\end{align}

$$

where in the penultimate line we used the likelihood function from above and the last line is the typical Monte Carlo estimator for the true posterior mean.

To quantify the uncertainty of our prediction ##\hat{y}##, one can use the covariance matrix ##\Sigma_{y|x,X}##. In the paper, the authors give the formula (without calculation)

$$

\Sigma_{y|x,X} = \approx \frac{1}{|S|-1} \sum_{\theta_{i} \in S} (f_{\theta_{i}}(x) - \hat{y})(f_{\theta_{i}}(x) - \hat{y})^{T}.

$$

I think this is wrong. The covariance is

$$

\Sigma_{y|x,X}=\mathbb{E}_{y|x,X}\Big((y-\hat{y})(y-\hat{y})^{T}\Big).

$$

Following the computation of ##\mathbb{E}_{y|x,X}(y)## from the beginning with ##(y-\hat{y})(y-\hat{y})^{T}## inserted for ##y##, one arrives at

$$

\Sigma_{y|x,X} = \mathbb{E}_{\theta | X} \Bigg( \mathbb{E}_{y|x,\theta}\Big((y-\hat{y})(y-\hat{y})^{T}\Big) \Bigg).

$$

It looks like they now just set ##\mathbb{E}_{y|x,\theta}\Big((y-\hat{y})(y-\hat{y})^{T}\Big) = (\mathbb{E}_{y|x,\theta}(y)-\hat{y})(\mathbb{E}_{y|x,\theta}(y)-\hat{y})^{T} ##, which would then reduce to their expression. But since the bracket term is a product, this should not be allowed as we typically have ##\big(E(V)\big)^2\neq E(V^2)##.

I tried to derive the correct expression on my own:

We use the other formula for the covariance,

$$

\begin{align}

\Sigma_{y|x,X} &= \mathbb{E}_{y|x,X} (yy^T) - \hat{y}\hat{y}^{T} \\

&= \mathbb{E}_{\theta | X} \Big( \mathbb{E}_{y|x,\theta}(yy^T) \Big) - \hat{y}\hat{y}^{T} \\

& \approx \frac{1}{|S|-1} \sum_{\theta_{i} \in S}\mathbb{E}_{y|x,\theta}(yy^T) - \hat{y}\hat{y}^{T} \\

&= \frac{1}{|S|-1} \sum_{\theta_{i} \in S}\Big( \Sigma + f_{\theta_{i}}(x)f_{\theta_{i}}(x)^{T}\Big) - \hat{y}\hat{y}^{T},

\end{align}

$$

where the last line comes from the definition of the likelihood ##p(y|x,\theta)## and the property that ##\Sigma = \mathbb{E}_{y|x,\theta}(yy^{T})-\mathbb{E}_{y|x,\theta}(y) \mathbb{E}_{y|x,\theta}(y^T)##.

Do you agree with all of this?

Cheers!

SL

Last edited: