Homework Help: Normal Distribution - self study / review

1. Jul 8, 2014

eehiram

1. The problem statement, all variables and given/known data

My source textbook is the community college / junior college (confer: undergraduate-lower division)
probability and statistics textbook:

An Introduction to Mathematical Statistics and Its Applications, 2nd Edition
Authors: Richard J. Larsen, Morris L. Marx
1986 Prentice-Hall

[See relevant equations for Theorem 4.3.1 (DeMoivre-Laplace) equation on Normal Distribution below...]

I would like to review easiest analyses of deviations from Normal Distribution?
Exempli gratia: mean, variance, efficiency, consistency, sufficiency?

2. Relevant equations

$\lim_{n\rightarrow \infty} {P \Big(c < {\frac{X - np} {\sqrt{npq}}} < d \Big)} = \frac{1} {\sqrt{2\pi}} \int_c^d e^{\frac{-x^2} {2}} \,dx$

3. The attempt at a solution

1. What are the easiest analyses of deviations from Normal Distribution?

Exempli gratia: mean, variance, efficiency, consistency, sufficiency?

a) (Section 4.3; Theorem 4.3.3)
(Mean is represented by μ)
Expected value E(x) can be calculated as:
E(X) = μ

b) (Section 4.3; Theorem 4.3.3)
Variance can be calculated as:
Var(X) = σ2

c) (Section 5.3-5.5; Definition 5.5.1)
Efficiency of estimators W1 and W2 can be compared as:
W1 is more efficient than W2 when:
Var(W2) < Var(W2)

Also, the comparison can be made:
Var(W2) / Var(W1)

d) (Section 5.7; Definition 5.7.1)
Consistency of an estimator Wn implies:

i. Wn is asymptotically unbiased;

ii. Var(Wn) converges to 0.

e) (Section 5.7; Definition 5.7.2; Theorem 5.7.1: Fisher-Neyman Criterion)
Determination of Sufficiency can be facilitated by way of: Fisher-Neyman Criterion.

[I can supply Definition 5.7.2; Theorem 5.7.1: Fisher-Neyman Criterion if requested... Or an explanation of Sufficiency.]

[Any other beginner's introduction to Estimation, Unbiasedness, and deviations from Normal Distribution will be welcomed.]

2. What is the frequency of departures from Normal Distribution when considering near-to-Normal Distribution data?
(BTW the data need not be real data, such as my textbook's examples of astronomical data.)

3. (Completely optional bonus round!)
How can the Gaussian Function be resolved, as the probability density function of the Normal Distribution?

[Thanks to any who reply, to any part of this post!]
1. The problem statement, all variables and given/known data

2. Relevant equations

3. The attempt at a solution

2. Jul 9, 2014

Simon Bridge

Did you have a question?

3. Jul 9, 2014

eehiram

These were intended as my questions:

I'm sorry that my questions were not clearly stated initially:

1. What are the easiest analyses of deviations from Normal Distribution?
(Examples are given only as a starting point.)

2. What is the frequency of departures from Normal Distribution when considering near-to-Normal Distribution data?
(BTW the data need not be real data, such as my textbook's examples of astronomical data.)

3. (Completely optional bonus round!)
How can the Gaussian Function be resolved, as the probability density function of the Normal Distribution?

4. Jul 9, 2014

Simon Bridge

1. interpreting "deviation from Normal Distribution" to mean how poor a fit the normal distribution is to the data:
We would more commonly look for a goodness of fit instead of a poorness of fit.
This is because there is no data that the Normal Distribution will be a perfect fit for.

2. The Normal distribution is a theoretical limit - there exists no data for which it is an exact fits.
Hence: "goodness of fit" methods.

3. The Normal Distribution function is a special case of the Gaussian function.
In physics, for historical reasons, it is common to refer to say "Gaussian" instead of "Normal" to avoid the emotional baggage some people bring with them when they hear the word "normal".

You can turn one into the other by comparing their respective equations.

5. Jul 11, 2014

eehiram

Simon Bridge:
Thank you for the response.

Perhaps some examples of reality-based data, though I don't need them to be real, might be illustrative of what I meant by "deviation from" the Gaussian / Normal Distribution, or Estimation of the Gaussian / Normal Distribution:

a) astronomical data

(Please do not think I need reference to real orbits of planets, comets, asteroid belts, et alii; fictional data will suffice for this illustration.)

b) human population data

i. country populations (total size);
GDP per capita;
population density;
taxes paid per capita (depending on progressive income tax brackets and other factors);
et alii.

ii. legal / accounting data

iii. school / student data

c) games of chance

i. coin tosses data, rolls of die data, card game data, et alii;

d) weather / meteorological data

Simeon Denis Poisson (b.1781; d.1840) applied probability to the law;
in 1837, he wrote Recherches sur la Probabilite de Jugements, which included a limit theorem for the binomial distribution.
This theorem was the seed for the Poisson Distribution.

According to:
http://mathworld.wolfram.com/NormalDistribution.html

Thanks for responding!

Last edited: Jul 11, 2014
6. Dec 26, 2014

eehiram

Simon Bridge:

First, I want to thank you again for responding to my thread on 'Normal Distribution - self study / review' on July 9, 2014. I originally hoped that this thread would serve as an opportunity for me to study and review my textbook, which I cited in my post #1.

I'm sorry for the delay in my response. One reason is that I was struck by the implied sentiment in your post: you seemed to be angry with me, and I was reluctant to post again. I decided to take time off and try to figure out how to respond, if at all. I value my membership at Physics Forums and do not want to get suspended or otherwise disciplined. So I was leery of posting carelessly again.

Thus, I would like to attempt to clarify the reasoning for my thread, if it's alright. (I hope that I won't seem too annoying by posting again; yet by now I feel that I would prefer to attempt to resume this thread on Normal Distribution that I typed in so carefully.) I might as well explain myself rather than hold back, since the thread has remained inactive for more than 6 months. I want to clarify my reasoning in the hope that it will make clear what kind of discussion I had hoped to get to eventually with this thread.

But before I clarify the reasoning for my thread: I would like to reassure you that, although it may not seem like it, I did indeed read your post carefully on "goodness of fit" as your redirection from "poorness of fit". I understood your redirection, both in the literal meaning, and as a gesture of disapproval toward me.

And I recognize that I did not seem to acknowledge your redirection in my subsequent post. My failure to redirect may have appeared to show that I did not read your post carefully. In fact I did read it, and was unsure about how to proceed with my response, but since my response (post #3 of mine) turned out so long, I initially thought I had better wait for another response from you before acknowledging your redirection. Now that no response has come after 6 months, I am writing this post #4 of mine with the desire to avoid another miscommunication.

I hope that my clarification on the reasoning for this thread will help to alleviate any of your frustration for my not immediately switching in my 3rd post, on Jul 11, 2014 to a query on "goodness of fit", as per your redirection. I hope that this 4th post will make clear what kind of discussion I had wanted to get to regarding "poorness of fit" in the first place, and perhaps address any other doubts that you may have concerning my desire for this thread.

I appreciate your noting in your response that essentially 0% of actual statistical data is an exact match for Normal Distribution. I felt that you were noting something usually misused or misunderstood; but it also seemed that you were annoyed with me. I will explain below why I was interested in actual statistical data that is not an exact match, but sufficiently in agreement, with Normal Distribution. Enough agreement to justify the use of the Normal Distribution is sufficient to then lead to an estimate of the percentage.

I did understand your reference to "Gaussian" Distribution as an alternative, less emotionally-charged name for a Normal Distribution as well. In this case, I'm not sure if you would indeed prefer me to edit all my subsequent references to Normal Distribution as Gaussian Distribution. I am willing to do so, but will wait for you to state such a recommendation. In this post #4, I will continue to use my probability and statistics textbook's term: Normal Distribution. (I had intended my question about Gaussian Distribution to be an optional, extra question, which you have answered in your response.)

In the following paragraphs, I will attempt to go ahead and explain why I am interested in the comparison between Normal Distribution and actual statistical data:

I had hoped for an eventual discussion as to whether or not the Normal Distribution is the majority or minority when compared to actual statistical data. My examples of statistical data were listed in post #3, and were not limited to the following:

a) astronomical data
b) human population data
c) games of chance
d) weather / meteorological data

These examples are supposed to give an idea of what I am referring to with "actual statistical data".

(I am not asking for an exact percentage of actual statistical data that matches the Normal Distribution. I do not expect you to look up such a percentage. Rather, I would merely like to see a loose estimate of that percentage, perhaps as a multiple of 10%, so as to clarify the question about majority / minority of actual statistical data.)

If the Normal Distribution does indeed model the majority of actual statistical data, this leads to a similar question for me: is the Normal Distribution a useful assumption to employ in everyday calculations and estimates? Is it an accurate model for offhand and quick calculations? Or is it too misused and misunderstood to be a good choice? (I worry that it may be too misused and misunderstood, such as in informal contexts and verbal discussions. The proper formulation, taken straight from a textbook, may be necessary to use Normal Distribution correctly and to benefit those who attempt to employ it.)

If the Normal Distribution represents the majority of actual statistical data, then: when are the times when Normal Distribution is not a good model for actual statistical data? For examples, some statistical data had peculiar tendencies, such as to over-represent certain individual values; or had gaps to jump across, in between which the frequency was 0. Can someone here name examples of statistical data that tends away from Normal Distribution? Such examples might be helpful for me to have a more general grasp of when Normal Distribution should be avoided.

I did not write these questions in my initial posts due to my desire to avoid informality. (Instead, I wrote my request for analysis of the deviations from Normal Distribution, which was redirected to the term "goodness of fit".) I had hoped originally that these informal questions would be addressed - in general - by your responses, for which I was expecting you to be building into an interesting discussion of Normal Distribution in the context of probability and statistics. Thus I decided not to write my informal questions at first... but now there's not enough in your responses to get answers to my informal questions. So after seeing that the thread died, and 6 months have passed, I have decided to reveal my informal questions in the hopes that it may help you to understand my earlier posts.

I apologize for the length of this post #4.

7. Dec 26, 2014

Ray Vickson

The nature of the errors depend on the underlying distibution. If X is truly Binomial, a lot is known about the accuracy of the normal limit, and you can find discussions in any good probability book, such as Feller, Introduction to Probability Theory and its Applications, Vol. I, Wiley. Feller has a whole chapter dealing carefully with the normal limit of the binomial, and he cites numerous past work, including the original sources.

For a more general problem, you are perhaps asking about errors in taking the normal distribution when using the Central Limit Theorem, but in the case of finite (large) $n$ rather than in the true limit $n \to \infty$. Look up the Berry-Esseen Theorem; see, eg., http://en.wikipedia.org/wiki/Berry–Esseen_theorem . It has some of the basic results and refers to some of the original literature---leaving you in a position to do a more extensive search.

BTW: I saw NOTHING in Simon Bridge`s post that was, in any way an indication of anger. Asking if you have a question is just asking for clarification of what you want---nothing more, nothing less.

8. Dec 28, 2014

eehiram

Ray Vickson:

Thank you for the excellent response, with a referral to a textbook, a referral to a nice Wikipedia article, and reassurance concerning Simon Bridge! I hope that we can have a good dialogue about Normal Distribution on this thread, and that this thread will continue to force me to review my old, dusty 2 probability and statistics textbooks and 1 study guide.

(I took my lower-division math class on probability and statistics in fall 1993 and my upper division class in spring 1997. A long time has passed now since my last math class, so I am happy that I was able to do a few hours of review for this thread.)

You wrote that if the random variable X is truly binomial, a lot is known about the accuracy of the normal limit. In my review of my textbooks, I saw that the "binomial" term can refer to a random variable, a distribution, or a probability (and perhaps other terms). The Binomial Distribution itself was developed before the Normal Distribution, and led to the development of the Normal Distribution. Please be patient with me as I try to make sense of which additional term you are using whenever you write "binomial".

Thank you for citing your textbook by Feller. It sounds like a good textbook, but I probably won't buy it now, as I have enough probability and statistics textbooks.

My 2 introductory textbooks and 1 study guide explain the Central Limit Theorem as they end part I on descriptive statistics. Then they both transition to part II on inferential statistics. The Berry-Esseen Theorem is not arrived at in either textbook, nor in the study guide. The nice Wikipedia article on Berry-Esseen Theorem seems to explain that the error, |Fn(x) - Φ(x)|, can be bounded according to the specific qualifications of the theorem, and that in the case of independent samples, the convergence rate is n-1/2, where n = sample size.

I noticed that you quoted my post #3 on types of data: astronomical data, human population data, games of chance data, weather / meteorological data. Can you - without needing to look up data or make any huge, undue effort - apply the Berry-Esseen Theorem to find the error in taking the Normal Distribution for this type of data or similar data? I don't expect a large calculation from you, of course.

Otherwise, what other theorems, equations, or insights can you share to relate to my list of types of data in regards to the selection of Normal Distribution? Any information will be welcome.

I'm glad to be reassured about Simon Bridge's response not containing the anger or annoyance that I perceived. I may have misinterpreted his writing tone, as can happen in written messages. Both of you are helpful and patient, and I do value both of your roles as Homework Helpers at this website.