Normalizing standard deviation between two data sets

Click For Summary

Discussion Overview

The discussion revolves around methods for normalizing standard deviations between two data sets collected under different conditions, specifically focusing on how to analyze the differences in data while accounting for variations in standard deviation. The scope includes statistical analysis techniques, hypothesis testing, and considerations for data distribution characteristics.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant inquires about methods to "cancel out" standard deviations to focus on differences in data due to temperature changes.
  • Another participant suggests converting both data sets to a standard normal distribution using z-scores, indicating that this could help in comparing the data sets.
  • A different participant notes that changes in temperature might affect both the mean and standard deviation, recommending a t-test to assess mean differences, with a variant for unequal variances if necessary.
  • One participant emphasizes the importance of graphing the data to identify outliers and skewness, which could affect the validity of classical statistical tests, and suggests using non-parametric tests if necessary.
  • There is a discussion about whether to adjust significance levels when conducting multiple tests, with one participant expressing skepticism about statistical methods and preferring to collect more data instead.
  • Another participant asserts that no data is truly normal and advises against normal-based inferences if severe non-normality is indicated, suggesting that comparing t-tests and non-parametric tests does not require adjusting significance levels.

Areas of Agreement / Disagreement

Participants express differing views on the best approach to analyze the data, with some advocating for traditional t-tests and others suggesting non-parametric methods. There is no consensus on whether to adjust significance levels when performing multiple tests, and the discussion remains unresolved regarding the best practices for handling non-normal data.

Contextual Notes

Participants highlight limitations related to assumptions of normality, the robustness of statistical tests in the presence of outliers, and the potential need for different analytical approaches based on data characteristics.

TheAnalogKid83
Messages
174
Reaction score
0
"Normalizing" standard deviation between two data sets

I have a baseline set of data collected which has a standard deviation of X, and then I collected another set of the data under a different condition (different temperature), and this has a different standard deviation Y. How do I cancel out the standard deviations, to only see the difference of the actual data as it varies with the condition change (temperature)? I try to stay vague, but if this doesn't make sense of what I'm looking for, I'll give my application example. Just a general topic to point me to would help if nothing else.
 
Physics news on Phys.org


Seems to me that you could convert both to a "standard" normal distribution by taking x'= (x- \mu_x)/\sigma_x and y'= (y- \mu_y)/\sigma_y. If you don't want to worry about the means, just dividing by the standard deviation of each should give you a distribution with standard deviation 1.
 


The change in temperature may have affected both the mean and standard deviation of the "true" probability distribution. If by eye the two sample standard deviations look the same, just use a normal t-test to see if the means are different. If the standard deviations look very different, or if you suspect on theoretical grounds that the standard deviations are different, then use a t-test variant in which the standard deviations are not assumed equal.

See for example:
4.3.3 Unequal sample sizes, unequal variance
http://en.wikipedia.org/wiki/Student's_t-test
 


One more comment on this. Graph your data first - whether a simple dotplot, stemplots, boxplot, or histogram if the samples are large. Look for evidence of outliers and/or skewness. Both of these can cause problems with the classical procedures, as they are not robust in the face of departures from normality. If you see skewness (or even several outliers with overall symmetry) you should also do a non-parametric test (Wilcoxon or equivalent) as well. (I would suggest always doing this, but my training is in non-parametrics.) Intuitively, if the two results are in agreement, the t-test results may be good enough. If the two results are in great disagreement, you should suspect the t-test results.
(DO NOT be tempted to throw away outliers in order to obtain a specific result: unless the outliers are due to recording error, that is not valid)
 


statdad said:
One more comment on this. Graph your data first - whether a simple dotplot, stemplots, boxplot, or histogram if the samples are large. Look for evidence of outliers and/or skewness. Both of these can cause problems with the classical procedures, as they are not robust in the face of departures from normality. If you see skewness (or even several outliers with overall symmetry) you should also do a non-parametric test (Wilcoxon or equivalent) as well. (I would suggest always doing this, but my training is in non-parametrics.) Intuitively, if the two results are in agreement, the t-test results may be good enough. If the two results are in great disagreement, you should suspect the t-test results.
(DO NOT be tempted to throw away outliers in order to obtain a specific result: unless the outliers are due to recording error, that is not valid)

Yes, I agree with that. One thing I've never understood properly is if you do two independent tests, say t-test and Wilcoxon, then should you change the p value for what you accept as "significant" (ie. analogous to Bonferroni and its ilk)? In which case, maybe do only the Wilcoxon if non-Gaussianity is suspected? I usually just distrust statistics and collect more data, unless I need the paper published immediately. :rolleyes:
 


No data is truly normal (Gaussian), although the 'middle' can quite closely resemble normally distributed data. If your initial graphs indicate severe non-normality, it's usually best to avoid normal-based inferences altogether.
If you do both a t-test and a non-parametric test for comparison, as a simple check, there isn't any real need to adjust significance levels at all.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 16 ·
Replies
16
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 22 ·
Replies
22
Views
4K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 6 ·
Replies
6
Views
1K