Why Standardize Data to Mean 0 & SD 1 in Multivariate Analysis?

  • Thread starter Philip Wong
  • Start date
  • Tags
    Data
In summary, when performing multivariate analysis and standardizing the mean to 0, the standard deviation must be set to 1 because it allows for the actual variation of the sample from the population to be represented through z-scores. While using a higher standard deviation would result in smaller z-scores and less variation, it is important to use 1 as it is the standard deviation of z and ensures that the formula for z-scores accurately represents the sample average and linear combinations of data points.
  • #1
Philip Wong
95
0
hi guys,
I'm wondering in multivariate analysis, when we standardize our mean to 0 (centroid our data) why do we have to set our S.D = 1?

I mean doesn't S.D = 1, only coves 50% of the data? shouldn't we use S.D = 3? where 99.95% of sampling data was covered instead? When we centroid our data, we've already lost some information, so why then do we still use S.D =1?

can someone explain it to me! thanks!
 
Physics news on Phys.org
  • #2
You don't set sd =1 because you standardize the mean.

The reason for sd = 1 is because, as part of standardization, you are dividing each data point with the actual sd.
 
  • #3
EnumaElish said:
You don't set sd =1 because you standardize the mean.

The reason for sd = 1 is because, as part of standardization, you are dividing each data point with the actual sd.

can you explained it more on what you mean?
I understand up to the point, where we use s.d. = 1 because that would give us the actual variation of the sample from the population (i.e. it got something to do with z-score). But does it have to be 1? or it could be any number up till and including s.d.=3?

let say if we could use any s.d., I'm aware that with a higher s.d. it gives us a smaller z score, correct? meaning it has lesser variation between sample and population, correct?

the reason I came to this conclusion is because, z = (mean - mu)/sigma
 
  • #4
z = (data point - mu)/sigma for any data point, and for any linear combination of data points. Since "mean" (sample average) is a linear combination of the data points, the same formula applies to the sample average.

In general if the variance of a random variable X (read: data points) is V then the variance of bX + a is Vb^2, where a and b are constants (make the substitutions a = -mu/sigma and b = 1/sigma). Since sd is the sqrt of variance, it follows that sd of z has to be 1.
 
  • #5


Standardizing data to mean 0 and standard deviation (SD) 1 in multivariate analysis is a common practice that has several benefits. First, it allows for easier comparison between variables with different units and scales. This is because all variables are now on the same scale, making it easier to interpret their relationships. For example, if one variable is measured in meters and another in kilograms, standardizing them to mean 0 and SD 1 would allow for a more direct comparison between the two.

Second, standardizing data to mean 0 and SD 1 helps to reduce the influence of outliers on the analysis. Outliers can have a significant impact on the mean and standard deviation of a dataset, and by standardizing the data, we can minimize this effect and make the analysis more robust.

Additionally, setting the SD to 1 is a convention in multivariate analysis. It allows for a more intuitive interpretation of the data, as the SD represents the average distance of data points from the mean. Setting it to 1 means that 68% of the data falls within one standard deviation from the mean, which is a commonly used benchmark in statistical analysis.

Furthermore, using SD = 3 as suggested in the question would result in a larger range of data being considered as "normal" or within an acceptable range. This could potentially mask important variations in the data and make it more difficult to identify patterns and relationships.

In summary, standardizing data to mean 0 and SD 1 in multivariate analysis is a common and useful practice that allows for easier comparison between variables, reduces the influence of outliers, and follows established conventions. While it may result in some loss of information, the benefits of standardization often outweigh this drawback.
 

1. Why is standardizing data to mean 0 and SD 1 important in multivariate analysis?

Standardizing data to mean 0 and SD 1 is important in multivariate analysis because it allows for easier interpretation and comparison of variables. It also helps to reduce the effects of outliers and differences in measurement units between variables.

2. How does standardizing data to mean 0 and SD 1 affect the results of multivariate analysis?

Standardizing data to mean 0 and SD 1 does not change the relationship between variables, but it does change the scale and distribution of the data. This allows for easier interpretation and comparison of variables, and can also improve the accuracy of statistical models.

3. Can I standardize data to mean 0 and SD 1 in any type of multivariate analysis?

Yes, standardizing data to mean 0 and SD 1 is commonly used in various types of multivariate analysis, such as principal component analysis, factor analysis, and cluster analysis. It is generally recommended to standardize data before conducting multivariate analysis.

4. Is standardizing data to mean 0 and SD 1 always necessary in multivariate analysis?

No, standardizing data to mean 0 and SD 1 is not always necessary in multivariate analysis. It depends on the specific goals and methods of the analysis. For example, if the goal is to compare the means of different groups, standardization may not be necessary.

5. What are the potential drawbacks of standardizing data to mean 0 and SD 1 in multivariate analysis?

One potential drawback of standardizing data to mean 0 and SD 1 is that it can make the data less interpretable in its original units. It can also lead to difficulties in comparing results with other studies that did not use standardized data. Additionally, if there are extreme outliers in the data, standardization may not be effective in reducing their influence on the analysis.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
912
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
985
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
464
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
791
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
991
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
771
Back
Top