Chi-squared test for normality

In summary: Chi-squared value change when I separate the data points into different groups?In summary, the Chi-squared value for normality is affected by how the data points are separated.
  • #1
Joon
85
2

Homework Statement


Hello, I was given 2 sets of data, showing 20 temperature values and 35 temperature values respectively. The data sets look like below:

Data 1 Data 2
Temperature Temperature
30.9 28.5
30.6 30.4
..
..
continued (20 values) continued (35 values)

I have done the Chi-squared test for these two sets of data. However, I am also required to analyse the third case, where no data is provided. It was only given that the third dataset consists of 66 data points (and within the limits of any random effects matches the characteristics of datasets 1 and 2) and I need to suggest the number of Chi-squared groups for the third dataset.

I want to ask what the best way is to determine the number of Chi-squared groups for specific number of data points (in this case 66).

Homework Equations

The Attempt at a Solution



For the first and second sets of data, I simply separated the values by 5 points, so 4 groups for the first set and 7 for the second.
 
Physics news on Phys.org
  • #2
The Chi-squared tests for normality I know use the 68-95-99.7% rule: You compute the sample data mean, SE and then you expect 68% of the data points to be within 1 SE from the mean, etc. Is that the one you have in mind?

Also, I am not sure I understood. Did you conduct a Chi-Squared on each data set, checking for normality? Are you comparing the two data sets to test if they come from the same distribution? If the latter maybe a Wilcoxon rank test may work.
 
  • #3
Thanks for your reply, do you mean standard deviation by SE?
 
  • #4
Joon said:
Thanks for your reply, do you mean standard deviation by SE?
Sure. But, do you have the population SD? If not, you need to use SE from the sample data.
 
  • #5
It was only given that the third dataset has 66 data points, nothing more. In this case, should I use the mean and SD taken from Dataset 1 or 2?

I am required to suggest a possible scheme, suggest how the 66 data points could be separated into n groups.
 
  • #6
I would think to do the same type/level of binning as in the other cases, i.e., the same number of categories. There isn't really much going on in normal distributions beyond 4sigma from the mean. If your data is equal up to random error, then it seems you would use the same number of bins/categories.

EDIT: Are the sample mean, SD roughly the same for data sets 1,2?
EDIT2: It ultimately comes down to comparing the observed frequency and compare it , using the Chi^2, to the expected frequency. If the data comes from the same pop., you can assume the same mean ( except maybe if your SE is extremely small) and SE for the third group.

It seems a bit confusing to me: did you reject/accept the claim from datasets 1,2? What is the purpose then for the 3rd data set? Just trying to understand better what you are aiming/testing for.
 
  • #7
I initially separated data 1 with 20 data points into 4 groups, 5 data points in each group. 7 groups for data 2 with 35 data points.
I want to ask a question:
Is the Chi-squared value of a dataset affected by how I separate the data points? For instance, I could separate 20 data points into 4 groups but in 4,6,4,6 points in each group. Or irrespective of number of groups and how many data points are in each group, the Chi squared value ends up having the same value?

Could you explain a bit more about the mean +- 4 sd ?
Also, by same number of bins/ categories, do you mean dataset 3 with 66 data points could just be separated into 66/5 (about 11 groups?)
 
  • #8
EDIT: Good questions, please give me some more time to think it through. My point about the ## pm 4 SD ##s (pm := plues/minus; I forgot my Latex for it) is re the 68-95-99.7 rule: Assuming normality, 99.7 % of your data will be within that range of the mean. But I will check the remainder. It is also an issue of how well/fast the data converges to a normal. You expect tht the larger the data set (assuming normality) the closer you get to an actual normal. 35 should look closer to normal than 20 and 66 would be closer to normal than 35. My thoughts are of using the SD in data set 2 and diving by ##\sqrt 66## instead of ## \sqrt 35 ## and then seeing how many SEs from that last data set cover the entire range of the data.
 
  • #9
For EDIT: For the first dataset, mean is 50.10 and SD is 9.77. For the second dataset, mean is 49.89 and SD is 9.98. They are similar to some extent but I'm not sure if it is okay to take the mean and SD values from one of these datasets for data 3.
EDIT 2: I understand what you mean. Thanks for the explanation.

It is my statistics coursework, and without giving the actual dataset for data 3, I'm required to divide 66 data points into n groups. Below are the questions that were placed right below the data 3 question part and I need to answer them, do you think these make data 3 more useful? (To test a student's knowledge on this topic)

Suppose that for a given grouping arrangement the sum of Chi-Squared values obtained was 0.1 (arbitrary units here) what would the confidence level be? Also, there is often some sensitivity to the grouping arrangements and if 2 other grouping arrangements produced Chi-Squared sums of 0.2 and 0.4 (arb. units) what would be the confidence levels for these cases?
-To be honest, I have no idea what the question requires. Confidence level can simply be checked from Chi squared distribution table if I know degrees of freedom, what do you think is the point of the question?
 
Last edited:
  • #10
I calculated SD using sqrt(66) instead of sqrt(35):
SD from dataset 2 is 9.98 so 9.98 * sqrt(35) / sqrt(66) = 7.27.
Dataset 2 has values from 28.4 up to 70.6 and therefore 70.6 / 7.27 gives 9.71.
 
  • #11
Joon said:
I calculated SD using sqrt(66) instead of sqrt(35):
SD from dataset 2 is 9.98 so 9.98 * sqrt(35) / sqrt(66) = 7.27.
Dataset 2 has values from 28.4 up to 70.6 and therefore 70.6 / 7.27 gives 9.71.
Ok, so my idea is that you do 4SEs unless you have enough outlier values beyond that. So you would go from the sample mean 4SE's in either direction unless you find enough outliers , though that is not likely under the assumption that data set 3 is similar to data sets 1,2. But let me mull it over some more.
 
  • #12
Sorry, I've just found this figure from my lecture note.
I suppose for data 3 it is not to do with any calculations and using mean and SD from either Data 1 or 2, it's just using this graph?
Data 3 with 66 data points will have 10 groups according to the graph.

How would I determine the number of members in each group though? Do you have any idea?
 

Attachments

  • stats.png
    stats.png
    14.6 KB · Views: 302
  • #13
@Joon It is hard to give useful advice without fully understanding the problem. A myriad of questions arise, eg are we to assume the two samples come from the same population, and that the third sample will, too?

Could you please supply the full statement of the problem, showing what we are told to assume and what we are asked to test for?
 
  • #14
Dataset 1,2 and 3 are all repeated data from the same sample. No actual data was given for dataset 3, just the number of datapoints: 66.
I have done the Chi squared analysis for Dataset 1 and 2. I just need to suggest a possible scheme for the number of groups and number of datapoints in each group for dataset 3. I've found the graph above, data with 66 data points should be split into 10 groups according to the graph.
66 / 10 = 6.6, I'm trying to figure out how many datapoints I should have in each group.
 
  • #15
Suppose that for a given grouping arrangement the sum of Chi-Squared values obtained was 0.1 (arbitrary units here) what would the confidence level be? Also, there is often some sensitivity to the grouping arrangements and if 2 other grouping arrangements produced Chi-Squared sums of 0.2 and 0.4 (arb. units) what would be the confidence levels for these cases?

For the bit above, I think it's just asking basic knowledge. Degrees of freedom = Number of groups - 1 - Number of variables = 10 - 1 - 2 = 7
The values of Chi-squared sums are given, so confidence levels can simply be calculated.
 

FAQ: Chi-squared test for normality

1. What is a Chi-squared test for normality?

The Chi-squared test for normality is a statistical test used to determine whether a given data set follows a normal distribution. It is used to assess the assumption of normality in a data set, which is important for many statistical analyses.

2. How does the Chi-squared test for normality work?

The Chi-squared test for normality works by comparing the observed frequencies of the data to the expected frequencies under a normal distribution. It calculates a test statistic, which is then compared to a critical value from a Chi-squared distribution. If the test statistic is greater than the critical value, the data is considered to deviate significantly from a normal distribution.

3. When should I use a Chi-squared test for normality?

A Chi-squared test for normality should be used when you are analyzing a data set and need to determine whether it follows a normal distribution. This is important for many statistical tests, as they assume that the data is normally distributed. If the data does not follow a normal distribution, alternative tests may need to be used.

4. How do I interpret the results of a Chi-squared test for normality?

The results of a Chi-squared test for normality will include a test statistic and a p-value. If the p-value is less than the chosen significance level (typically 0.05), then the null hypothesis (that the data follows a normal distribution) can be rejected. This means that the data does not follow a normal distribution. If the p-value is greater than the significance level, then the null hypothesis cannot be rejected and the data is considered to follow a normal distribution.

5. Are there any limitations to the Chi-squared test for normality?

Yes, there are some limitations to the Chi-squared test for normality. It is sensitive to sample size, meaning that larger sample sizes may lead to a significant result even if the deviation from normality is small. It can also be affected by outliers in the data. Additionally, the test assumes that the data is continuous and independent, so it may not be appropriate for all types of data.

Similar threads

Back
Top