How to tell if data is normally distributed?

In summary, there is a formal way to test if data is normally distributed, such as using the Kolmogorov-Smirnov test or q-q plots. However, relying solely on visual inspection or relying on significance testing with a large sample size can lead to the assumption of normality when it may not be true. Utilizing more robust methods is recommended.
  • #1
jimmy1
61
0
Is there a formal way of telling if my data is normally distributed?
I know I could plot a histogram for the data, and see if it follows a bell shaped curve, but I need something a lot more formal than this.
Is there a way to do it?
Thanks
 
Physics news on Phys.org
  • #4
I know one characteristic the Normal Distribution must have is the same Mean, Mode and Median, and it can only be unimodal. I'd simply test all of these factors and see if the numbers are the same. Though, I'm not sure if they have to be exact to the tenth. For example, I think if the Mode=71, Mean=70.6, and Median=71.2, and the only mode was 71, then it would be considered normally distributed.

I know you probably already figured this out, but I'm just adding my comment if some else may have problems. Or maybe I'm completely wrong on this and someone can help me.
 
  • #5
jimmy1 said:
Is there a formal way of telling if my data is normally distributed?
I know I could plot a histogram for the data, and see if it follows a bell shaped curve, but I need something a lot more formal than this.
Is there a way to do it?
Thanks

for normally distributed data,
skewness should be zero
kurtosis should be equal to 3

hope, it will help
 
  • #6
The comments about mean=median=mode, skewness = 0, kurtosis =3, are very unlikely to hold for real data. The normal distribution is an idealized model that describes general characteristics very well, but rarely (i would argue never) is exactly correct.

The tests typically allow you to conclude that your data "isn't significantly different" than what you expect from the normal model. Histograms are decidedly poor as an aid, since too much depends on the choices for bin width (and so number of bins) and the sample size.

You might look at the Kolmogorov-Smirnoff test (http://mathworld.wolfram.com/Kolmogorov-SmirnovTest.html)
which compares your sample's empirical distribution to a normal distribution, although it works best when you don't estimate the mean and standard deviation with the sample values.
q-q plots (quantile-quantile plots) are a useful visual tool.

what often occurs is you will see your data set resembling a normal distribution "in the middle", but problems will occur in the extremes (tails) - sadly, that's often the region in which you have the most interest.

Good luck with your investigations.
 
  • #7
A problem with shapiro wilks and some other tests is that they set the normal distribution as the null hypothesis and then see if the data gives a p-value low enough to reject. The reason this is an issue is because if you have a lot of data points, it is easy to reject the null of normality here. This is a bigger issue with significance testing in general, if you have a really large sample size you'll find all sorts of relationships in the data. This is one reason why people often just inspect the data visually.
 
  • #8
wvguy8258 said:
A problem with shapiro wilks and some other tests is that they set the normal distribution as the null hypothesis and then see if the data gives a p-value low enough to reject. The reason this is an issue is because if you have a lot of data points, it is easy to reject the null of normality here. This is a bigger issue with significance testing in general, if you have a really large sample size you'll find all sorts of relationships in the data. This is one reason why people often just inspect the data visually.

The comment about downsides of S/W test and tests in general is valid, but while

"This is one reason why people often just inspect the data visually" may be true, it's an incredibly bad thing to do. Again, most data is "normal in the middle" with problems in the tails. With the unreliability of histograms, and with those being so commonly used, the "assumption" of normality is made more often than it should be.

"This is one reason why people should use robust methods" would be a better comment.
 

What is the definition of normal distribution?

Normal distribution is a common probability distribution that is often described as a "bell curve" due to its shape. It is characterized by a symmetrical, mound-shaped curve and is widely used in statistics and scientific research.

How can I visualize if my data is normally distributed?

One way to visualize normal distribution is by creating a histogram of the data. A histogram is a graph that shows the frequency of values in a dataset. If the histogram resembles a bell-shaped curve, then the data is likely normally distributed.

What statistical tests can I use to determine if my data is normally distributed?

There are several statistical tests that can be used to determine if data is normally distributed, such as the Shapiro-Wilk test, Kolmogorov-Smirnov test, and Anderson-Darling test. These tests compare the data to a normal distribution and provide a p-value, which indicates the likelihood that the data is normally distributed. A p-value of less than 0.05 is typically considered non-normal.

Is it important for my data to be normally distributed?

It depends on the analysis you are conducting. Some statistical tests, such as t-tests and ANOVA, assume that the data is normally distributed. If your data is not normally distributed, you may need to use alternative tests or transform the data to meet the assumption of normality.

What should I do if my data is not normally distributed?

If your data is not normally distributed, you may need to use non-parametric tests, which do not assume normality. Alternatively, you can try transforming your data using methods such as log or square root transformations to make it more normally distributed. It is important to consult with a statistician or conduct further research to determine the best approach for your specific data and analysis.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
322
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
719
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
435
  • Set Theory, Logic, Probability, Statistics
Replies
25
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
304
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
776
Back
Top