What is the relationship between sample size and statistical relevance?

  • Thread starter Andre
  • Start date
In summary, the conversation discusses the concept of statistical relevance and how it relates to correlation values. The participants also share their methods of investigating this concept, including using spreadsheets and generating random numbers. They also discuss the distribution of the sample correlation coefficient and its relationship to sample size. The conversation concludes with a reference to a relevant webpage for further information.
  • #1
Andre
4,311
74
I'm helping a friend writing a study and we stumble upon statistical relevance. Having to investigate the correlation of a certain data set with a population n=67 with many other datasets, we wondered, at what correlation value one can speak of statistical relevant. So we ducked into wikipedia but found no simple straighforward answer, instead we soon found ourselfs back linking deeper and deeper in a maze of math.

However, having juggled with excel a lot, it was pretty simple to run umpty (say 5000) correlations of gaussian distributed random number sets (null hypothesis) and get the standard deviation σ. This turned out to be 12.3%, so I guess it is safe to say that if you want a statistical relevance of 95% certainty of some real correlation, with of 67 data pairs, the corrolation needs to be greater than 3σ or 3*12.3=37%. Right?

So I got curious what variation in sample size would do to this ρ for the null hypothesis, so I ran the same spreadsheet for a couple of larger and smaller sample sizes, plotted them and found this:

2h5n1xh.jpg


So that trendline is unbelievably close to simply σ=1/√n

Of course such simple relations usually have been found already a century or two ago, so my question is to whom we should refer if we use this formula in the study to justify what we consider statistical relevance.
 
Physics news on Phys.org
  • #2
Andre said:
we soon found ourselfs back linking deeper and deeper in a maze of math.

The most relevant maze of math that I found is:http://mathworld.wolfram.com/CorrelationCoefficientBivariateNormalDistribution.html


so I guess it is safe to say that if you want a statistical relevance of 95% certainty of some real correlation, with of 67 data pairs, the corrolation needs to be greater than 3σ or 3*12.3=37%. Right?

Are you assuming the correlation coefficient of a sample of 67 pairs is normally distributed? The result I found says that there is a function of the sample correlation coefficient that has a t-distribution.

Edit: Here's page that mentions using a normal approximation for that function:
http://demonstrations.wolfram.com/NullDistributionOfTheCorrelationCoefficient/

So I got curious what variation in sample size would do to this ρ for the null hypothesis, so I ran the same spreadsheet for a couple of larger and smaller sample sizes, plotted them and found this:

You should define how you got the variables that were plotted. Did you assume the sample correlation was normally distributed in the calculation?
 
Last edited:
  • #3
Thank you so much for your interest. Yes I really wondered if the failure to find this relationship (σ=1/√n) back somewhere, could be attributed to an error I made, or maybe to my limited abilities to explain. So I'd really appreciate to get to the bottom of this.

You can dig out my basic spreadsheet from my dropbox http://dl.dropbox.com/u/22026080/monte-carlo-en.xlsx .

The gaussian distributed random numbers are generated in the tab "casino", it is based on 'Skip' Carters algorithm ("polar form of the Box-Muller transformation"), that I used earlier to play with random walks and the hurst exponent.

I had no idea how to emulate the loop, so I prepared 5 sets of numbers in columns D-K to generate the value "w" in column L from the first set, if w>1, then a new w is generated from the second set in column L, etc, ec, After 5 iterations like that, it turned out that the failure rate was acceptably low IMO, if still failure, then we use 0 for practical purposes, generating a marginally small error. The resulting number pairs are in column A and B.

Evidently, using those pairs to correlate, would introduce a bias because of the test in the algoritm, so I displaced the B column one row, to make sure that the pairs are unrelated.

Then in the Roulette-67 tab, we find the generation of 5000 correlation calculations, giving the standard deviation in cell E2

The funny formulae in the E column greatly facilitates the generation of the right intervals in the correlation calculations in column C for any population N, as given in cell A2 (67 in this case). Just paste the value of the generated strings from the E column to the C column and activate them as calculations with an <enter> in every cel.

Note that the argument separator is a semicolon ; in my dutch Excel version, since the comma , is the decimal seperator in Dutch, so I changed it in the formula (indicated >>>,<<<) to make it work (hopefully) for the english version:

=CONCATENATE("=correl(casino!a";TEXT(A4+1;"#####");":casino!a";TEXT(A5;"#####");">>>,<<<casino!b";TEXT(A4+1;"#####");":casino!b";TEXT(A5;"#####");")")

In my working version I used a little macro, to pick up the generated values in the b column and paste them in the F column and then insert empty cells, to stack up the calculations to the right. After repeating that macro 100 times, you get 5000 correlation calculations.

So I repeated that for populations of 10, 41 (used elsewhere), 100, and 500 to get that trendline.

Obviously this macro is not included, to prevent virus alerts going beserk.

I hope this helps.
 
Last edited by a moderator:
  • #4
I can't help with the spreadsheet math because I don't use Excel - I don't even use Windows.

The page:
http://demonstrations.wolfram.com/NullDistributionOfTheCorrelationCoefficient/
shows the distribution of the sample correlation coefficient [itex] r [/itex] for [itex] N [/itex] pairs of numbers, all drawn from the same normal distribution.

The page says that a "simple but reasonable approximation" for the distribution of [itex] r [/itex] is a normal distribution with mean zero and variance [itex] \frac{1}{N-1} [/itex].

That may explain your conjecture that your graph implies [itex] \sigma = \frac{1}{\sqrt{N}} [/itex]. The correct approximation, according to that page, would be [itex] \sigma = \frac{1}{\sqrt{N-1}} [/itex].
 
  • #5
It looks like you're right and that link gives me the reference I was looking for. Thanks.
 

What is statistically relevant?

Statistically relevant refers to data or information that is significant or meaningful in a statistical analysis. It indicates that the results or findings are likely to be true and not just due to chance.

How is statistical relevance determined?

Statistical relevance is determined through various statistical tests and measures, such as p-values, confidence intervals, and effect sizes. These tests assess the probability that the results are not due to chance and provide a level of certainty for the findings.

Why is statistical relevance important?

Statistical relevance is important because it allows us to draw conclusions and make decisions based on data and evidence rather than just speculation or intuition. It also helps to avoid making false assumptions or generalizations.

Can something be statistically relevant but not practically relevant?

Yes, something can be statistically relevant but not practically relevant. This means that the results may be statistically significant, but the effect or difference may be too small to have any practical or meaningful impact in the real world.

What factors can affect statistical relevance?

Several factors can affect statistical relevance, such as sample size, variability in the data, and the choice of statistical tests. It is important to consider these factors when interpreting the results of a statistical analysis.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
412
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Precalculus Mathematics Homework Help
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
Back
Top