Gradient descent, hessian(E(W^T X)) = cov(X),Why mean=/=0?

  • Context: Undergrad 
  • Thread starter Thread starter NotASmurf
  • Start date Start date
  • Tags Tags
    Gradient
Click For Summary

Discussion Overview

The discussion revolves around the implications of using a zero mean for inputs in the context of gradient descent and backpropagation algorithms in artificial neural networks (ANNs). Participants explore the relationship between the covariance matrix of inputs and the Hessian, as well as the significance of the mean in these calculations.

Discussion Character

  • Technical explanation
  • Conceptual clarification
  • Debate/contested

Main Points Raised

  • One participant notes that the error function in backpropagation is defined as E(X,W) = 0.5(target - W^T X)^2 and questions the necessity of a zero mean for inputs, suggesting that shifting the mean should not affect the covariance matrix.
  • Another participant suggests that the author may be performing "whitening," which involves normalizing data to have a mean of zero and a covariance matrix equal to the identity, commonly used in image processing.
  • A participant raises concerns about the implications of using a covariance matrix equal to the identity, questioning the uniqueness of eigenvalues and the logic behind using lambda*I - cov(X) instead of cov(X) inverse.
  • There is mention of a "standard Fresnel representation for the determinant of a symmetric matrix," with one participant expressing confusion about its definition and relevance to Gaussian multivariate optimization.
  • Another participant expresses gratitude for the mention of the Fresnel integral, indicating it clarifies their understanding of eigenvalue distributions derived from random matrices.

Areas of Agreement / Disagreement

Participants express differing views on the significance of a zero mean for inputs and the implications for the covariance matrix and eigenvalues. The discussion remains unresolved regarding the specific mathematical relationships and interpretations presented in the paper.

Contextual Notes

There are unresolved questions about the mathematical steps involved, particularly concerning the use of determinants and eigenvalues in the context of the covariance matrix and the implications of the mean being zero.

NotASmurf
Messages
150
Reaction score
2
Backpropagation algorithm E(X,W) = 0.5(target - W^T X)^2 as error, the paper I'm reading notes that covarience matrix of inputs is equal to the hessian, it uses that to develop its learning weight update rule V(k+1) = V(k) + D*V(k), slightly modified (not relevant for my question) version of normal backpropagation feedforward gradient descent, but he uses a mean of zero for the inputs, just like in all other neural networks, but doesn't shifting the mean not affect the covariance matrix, so the eigenvectors ,entropy H=|cov(x)|, second derivative is shouldn't change. It's not even a conjugate prior, its not like I am encoding some terrible prior beliefs if we view it as a distribution mapper. Why does the mean being zero matter here and ostensibly in all ANN's, any help appreciated.
 
Physics news on Phys.org
Sounds like the author is performing "whitening" Given some data set, take the eigenbasis divide by the eigenvalues and you get a normalized. If the data is multivariate Gaussian, then the data now as a mean of zero with a covariance matrix equal to the identity. It's a pretty common practice for image processing, unless there's a lot of white noise.

edit: Actually misread the post, ignore me :)
 
First, thanks for responding.
"If the data is multivariate Gaussian" it is a multivariate gaussian normal yes.

"data now as a mean of zero with a covariance matrix equal to the identity." but he also performs a gaussian integral on det(lambda*I - cov(X)), if cov(X) = I then it would have no eigenvalues, which need to be unique for this algorithm to work, and B) lambda*I - cov(X) is used where cov(X) inverse should be used, i know that A-lambda*I has no inverse, I don't know the logic behind him using lambda*I-A,

Also he says he's using something called "standard fresnel representation for the determinate of a symmetric matrix R" but "fresnel representation determinant" doesn't turn up anything that looks "standard" at all, the paper is over 25 years old to be fair. You have any idea what that is? Cos he seems to just be doing some gaussian multivariate optimization.
 
Without reading the paper, it's hard to make comments. However, there is a relationship between Fresnel Integral and Multivariate Gaussian, but I'm not well versed enough to say anything meaningful about it. I simply recall in grad school, my friend research random matrix and thesis was over such relationship.
 
"Fresnel integral" Thank you! that's just what i was looking for, the distrubutions derived for the eigenvalues from random matrices looks less arbitrary now. Last thing,

"data now as a mean of zero with a covariance matrix equal to the identity." are you saying that if mean is zero cov(X) is a diagonal with all dimensions having same magnitude?
 

Similar threads

  • · Replies 9 ·
Replies
9
Views
5K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 1 ·
Replies
1
Views
5K
Replies
6
Views
6K
  • · Replies 4 ·
Replies
4
Views
6K
Replies
7
Views
3K
  • · Replies 3 ·
Replies
3
Views
5K
  • · Replies 125 ·
5
Replies
125
Views
20K
  • · Replies 13 ·
Replies
13
Views
6K
  • · Replies 4 ·
Replies
4
Views
5K