Testing Randomness in a Set of 200+ Data Points

qspeechc · Oct 23, 2014

Hi everyone.

It's been years since I've done any stats, so I need a bit of help, please. I want to include it in a blog post I'm going to do (not here on PF), so I don't want to give away too many details :p I apologise for my terrible understanding of stats, please be patient!

Anyway, over ten years I have 20 data points for each year, i.e. 200 in total, which are positive integers. In practice they are never higher than 2000, although conceivably they could be. The assumption is that each number is generated randomly.

1) How do I test if a given data point is too large to be random, given that the other numbers tend to be smaller?

2) A 'source' produces one data point a year, how can I test if this source is producing abnormally high numbers over the ten years?

Thank you for any help.

mfb · Oct 23, 2014

For both cases I would compute the mean and standard deviation of the distribution of your 200 data points.
1) You can check if they follow a typical distribution (most notably the Gaussian distribution). If yes, everything that you would not expect given this distribution might be some real effect. You cannot be sure without a clear model, but you can get a good idea with that method.
2) Check the mean and expected deviation of this mean, see if it is compatible with the first distribution.

statdad · Oct 24, 2014

If all the values are integers they can't be "normally distributed", even if they are symmetric. (The normal distribution is just a convenient description of patters often found in data anyway, no data is truly normal.) But even if they are symmetric, the process outlined above would indicate only that a value is an "outlier" - that doesn't disqualify it as being non-random, simply identifies it as unusual in size.

As a first step you need to specify what qualifies as "abnormally high" (do you have a specific cutoff for that? If not, then saying something like any value more than 3 standard deviations above the mean, or more than 1.5IQR above the third quartile, is needed). Once this is done you might
* graph the data over time and look to see which years, if any, have unusally large values
* look at a plot of each year's data (boxplot?) to check just that group

But again, first making a more specific description of what you mean is where you need to begin.

Testing Randomness in a Set of 200+ Data Points

What is randomness and why is it important in data analysis?

How do you determine if a set of 200+ data points is random?

What factors can affect the randomness of a data set?

What are the limitations of testing randomness in a set of 200+ data points?

How can the results of testing randomness be interpreted?

Similar threads

Hot Threads

Recent Insights