What is the relationship between sample size and statistical relevance?

  • Context: Undergrad 
  • Thread starter Thread starter Andre
  • Start date Start date
Click For Summary

Discussion Overview

The discussion centers on the relationship between sample size and statistical relevance, particularly in the context of correlation coefficients. Participants explore how varying sample sizes affect the standard deviation of correlation values and the implications for determining statistical significance in research studies.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests that for a sample size of 67, a correlation greater than 37% may indicate statistical relevance at a 95% confidence level, based on their calculations of standard deviation.
  • Another participant questions the assumption that the sample correlation coefficient is normally distributed and references a t-distribution for the correlation coefficient.
  • A participant shares their methodology for generating random numbers and calculating correlations, expressing uncertainty about the validity of their findings and seeking feedback on their approach.
  • One participant points to a resource that approximates the distribution of the sample correlation coefficient as a normal distribution with a specific variance, suggesting this may clarify the earlier conjecture regarding the relationship between sample size and standard deviation.
  • Another participant acknowledges the usefulness of the referenced resource in providing clarity on the topic.

Areas of Agreement / Disagreement

Participants express differing views on the distribution of the sample correlation coefficient and its implications for statistical relevance. There is no consensus on the validity of the initial claims regarding the correlation threshold for statistical significance.

Contextual Notes

Participants note potential limitations in their assumptions about the distribution of correlation coefficients and the implications of sample size on statistical relevance. There is an acknowledgment of the complexity involved in these calculations and the need for careful consideration of the underlying statistical principles.

Who May Find This Useful

This discussion may be of interest to researchers and students involved in statistical analysis, particularly those exploring the implications of sample size on correlation and statistical significance in their studies.

Andre
Messages
4,296
Reaction score
73
I'm helping a friend writing a study and we stumble upon statistical relevance. Having to investigate the correlation of a certain data set with a population n=67 with many other datasets, we wondered, at what correlation value one can speak of statistical relevant. So we ducked into wikipedia but found no simple straighforward answer, instead we soon found ourselfs back linking deeper and deeper in a maze of math.

However, having juggled with excel a lot, it was pretty simple to run umpty (say 5000) correlations of gaussian distributed random number sets (null hypothesis) and get the standard deviation σ. This turned out to be 12.3%, so I guess it is safe to say that if you want a statistical relevance of 95% certainty of some real correlation, with of 67 data pairs, the correlation needs to be greater than 3σ or 3*12.3=37%. Right?

So I got curious what variation in sample size would do to this ρ for the null hypothesis, so I ran the same spreadsheet for a couple of larger and smaller sample sizes, plotted them and found this:

2h5n1xh.jpg


So that trendline is unbelievably close to simply σ=1/√n

Of course such simple relations usually have been found already a century or two ago, so my question is to whom we should refer if we use this formula in the study to justify what we consider statistical relevance.
 
Physics news on Phys.org
Andre said:
we soon found ourselfs back linking deeper and deeper in a maze of math.

The most relevant maze of math that I found is:http://mathworld.wolfram.com/CorrelationCoefficientBivariateNormalDistribution.html


so I guess it is safe to say that if you want a statistical relevance of 95% certainty of some real correlation, with of 67 data pairs, the correlation needs to be greater than 3σ or 3*12.3=37%. Right?

Are you assuming the correlation coefficient of a sample of 67 pairs is normally distributed? The result I found says that there is a function of the sample correlation coefficient that has a t-distribution.

Edit: Here's page that mentions using a normal approximation for that function:
http://demonstrations.wolfram.com/NullDistributionOfTheCorrelationCoefficient/

So I got curious what variation in sample size would do to this ρ for the null hypothesis, so I ran the same spreadsheet for a couple of larger and smaller sample sizes, plotted them and found this:

You should define how you got the variables that were plotted. Did you assume the sample correlation was normally distributed in the calculation?
 
Last edited:
Thank you so much for your interest. Yes I really wondered if the failure to find this relationship (σ=1/√n) back somewhere, could be attributed to an error I made, or maybe to my limited abilities to explain. So I'd really appreciate to get to the bottom of this.

You can dig out my basic spreadsheet from my dropbox http://dl.dropbox.com/u/22026080/monte-carlo-en.xlsx .

The gaussian distributed random numbers are generated in the tab "casino", it is based on 'Skip' Carters algorithm ("polar form of the Box-Muller transformation"), that I used earlier to play with random walks and the hurst exponent.

I had no idea how to emulate the loop, so I prepared 5 sets of numbers in columns D-K to generate the value "w" in column L from the first set, if w>1, then a new w is generated from the second set in column L, etc, ec, After 5 iterations like that, it turned out that the failure rate was acceptably low IMO, if still failure, then we use 0 for practical purposes, generating a marginally small error. The resulting number pairs are in column A and B.

Evidently, using those pairs to correlate, would introduce a bias because of the test in the algorithm, so I displaced the B column one row, to make sure that the pairs are unrelated.

Then in the Roulette-67 tab, we find the generation of 5000 correlation calculations, giving the standard deviation in cell E2

The funny formulae in the E column greatly facilitates the generation of the right intervals in the correlation calculations in column C for any population N, as given in cell A2 (67 in this case). Just paste the value of the generated strings from the E column to the C column and activate them as calculations with an <enter> in every cel.

Note that the argument separator is a semicolon ; in my dutch Excel version, since the comma , is the decimal seperator in Dutch, so I changed it in the formula (indicated >>>,<<<) to make it work (hopefully) for the english version:

=CONCATENATE("=correl(casino!a";TEXT(A4+1;"#####");":casino!a";TEXT(A5;"#####");">>>,<<<casino!b";TEXT(A4+1;"#####");":casino!b";TEXT(A5;"#####");")")

In my working version I used a little macro, to pick up the generated values in the b column and paste them in the F column and then insert empty cells, to stack up the calculations to the right. After repeating that macro 100 times, you get 5000 correlation calculations.

So I repeated that for populations of 10, 41 (used elsewhere), 100, and 500 to get that trendline.

Obviously this macro is not included, to prevent virus alerts going beserk.

I hope this helps.
 
Last edited by a moderator:
I can't help with the spreadsheet math because I don't use Excel - I don't even use Windows.

The page:
http://demonstrations.wolfram.com/NullDistributionOfTheCorrelationCoefficient/
shows the distribution of the sample correlation coefficient [itex]r[/itex] for [itex]N[/itex] pairs of numbers, all drawn from the same normal distribution.

The page says that a "simple but reasonable approximation" for the distribution of [itex]r[/itex] is a normal distribution with mean zero and variance [itex]\frac{1}{N-1}[/itex].

That may explain your conjecture that your graph implies [itex]\sigma = \frac{1}{\sqrt{N}}[/itex]. The correct approximation, according to that page, would be [itex]\sigma = \frac{1}{\sqrt{N-1}}[/itex].
 
It looks like you're right and that link gives me the reference I was looking for. Thanks.
 

Similar threads

  • · Replies 9 ·
Replies
9
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 21 ·
Replies
21
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 5 ·
Replies
5
Views
5K