What is the relationship between sample size and statistical relevance?

Andre · Nov 30, 2011

I'm helping a friend writing a study and we stumble upon statistical relevance. Having to investigate the correlation of a certain data set with a population n=67 with many other datasets, we wondered, at what correlation value one can speak of statistical relevant. So we ducked into wikipedia but found no simple straighforward answer, instead we soon found ourselfs back linking deeper and deeper in a maze of math.

However, having juggled with excel a lot, it was pretty simple to run umpty (say 5000) correlations of gaussian distributed random number sets (null hypothesis) and get the standard deviation σ. This turned out to be 12.3%, so I guess it is safe to say that if you want a statistical relevance of 95% certainty of some real correlation, with of 67 data pairs, the corrolation needs to be greater than 3σ or 3*12.3=37%. Right?

So I got curious what variation in sample size would do to this ρ for the null hypothesis, so I ran the same spreadsheet for a couple of larger and smaller sample sizes, plotted them and found this:

So that trendline is unbelievably close to simply σ=1/√n

Of course such simple relations usually have been found already a century or two ago, so my question is to whom we should refer if we use this formula in the study to justify what we consider statistical relevance.

Stephen Tashi · Nov 30, 2011

Andre said:

we soon found ourselfs back linking deeper and deeper in a maze of math.

The most relevant maze of math that I found is:http://mathworld.wolfram.com/CorrelationCoefficientBivariateNormalDistribution.html

so I guess it is safe to say that if you want a statistical relevance of 95% certainty of some real correlation, with of 67 data pairs, the corrolation needs to be greater than 3σ or 3*12.3=37%. Right?

Are you assuming the correlation coefficient of a sample of 67 pairs is normally distributed? The result I found says that there is a function of the sample correlation coefficient that has a t-distribution.

Edit: Here's page that mentions using a normal approximation for that function:
http://demonstrations.wolfram.com/NullDistributionOfTheCorrelationCoefficient/

So I got curious what variation in sample size would do to this ρ for the null hypothesis, so I ran the same spreadsheet for a couple of larger and smaller sample sizes, plotted them and found this:

You should define how you got the variables that were plotted. Did you assume the sample correlation was normally distributed in the calculation?

Andre · Dec 1, 2011

Thank you so much for your interest. Yes I really wondered if the failure to find this relationship (σ=1/√n) back somewhere, could be attributed to an error I made, or maybe to my limited abilities to explain. So I'd really appreciate to get to the bottom of this.

You can dig out my basic spreadsheet from my dropbox http://dl.dropbox.com/u/22026080/monte-carlo-en.xlsx .

The gaussian distributed random numbers are generated in the tab "casino", it is based on 'Skip' Carters algorithm ("polar form of the Box-Muller transformation"), that I used earlier to play with random walks and the hurst exponent.

I had no idea how to emulate the loop, so I prepared 5 sets of numbers in columns D-K to generate the value "w" in column L from the first set, if w>1, then a new w is generated from the second set in column L, etc, ec, After 5 iterations like that, it turned out that the failure rate was acceptably low IMO, if still failure, then we use 0 for practical purposes, generating a marginally small error. The resulting number pairs are in column A and B.

Evidently, using those pairs to correlate, would introduce a bias because of the test in the algoritm, so I displaced the B column one row, to make sure that the pairs are unrelated.

Then in the Roulette-67 tab, we find the generation of 5000 correlation calculations, giving the standard deviation in cell E2

The funny formulae in the E column greatly facilitates the generation of the right intervals in the correlation calculations in column C for any population N, as given in cell A2 (67 in this case). Just paste the value of the generated strings from the E column to the C column and activate them as calculations with an <enter> in every cel.

Note that the argument separator is a semicolon ; in my dutch Excel version, since the comma , is the decimal seperator in Dutch, so I changed it in the formula (indicated >>>,<<<) to make it work (hopefully) for the english version:

=CONCATENATE("=correl(casino!a";TEXT(A4+1;"#####");":casino!a";TEXT(A5;"#####");">>>,<<<casino!b";TEXT(A4+1;"#####");":casino!b";TEXT(A5;"#####");")")

In my working version I used a little macro, to pick up the generated values in the b column and paste them in the F column and then insert empty cells, to stack up the calculations to the right. After repeating that macro 100 times, you get 5000 correlation calculations.

So I repeated that for populations of 10, 41 (used elsewhere), 100, and 500 to get that trendline.

Obviously this macro is not included, to prevent virus alerts going beserk.

I hope this helps.

Stephen Tashi · Dec 1, 2011

I can't help with the spreadsheet math because I don't use Excel - I don't even use Windows.

The page:
http://demonstrations.wolfram.com/NullDistributionOfTheCorrelationCoefficient/
shows the distribution of the sample correlation coefficient [itex] r [/itex] for [itex] N [/itex] pairs of numbers, all drawn from the same normal distribution.

The page says that a "simple but reasonable approximation" for the distribution of [itex] r [/itex] is a normal distribution with mean zero and variance [itex] \frac{1}{N-1} [/itex].

That may explain your conjecture that your graph implies [itex] \sigma = \frac{1}{\sqrt{N}} [/itex]. The correct approximation, according to that page, would be [itex] \sigma = \frac{1}{\sqrt{N-1}} [/itex].

Andre · Dec 1, 2011

It looks like you're right and that link gives me the reference I was looking for. Thanks.

What is the relationship between sample size and statistical relevance?

What is statistically relevant?

How is statistical relevance determined?

Why is statistical relevance important?

Can something be statistically relevant but not practically relevant?

What factors can affect statistical relevance?

Similar threads

Hot Threads

Recent Insights