1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Basic Stats

  1. Oct 23, 2014 #1
    Hi everyone.

    It's been years since I've done any stats, so I need a bit of help, please. I want to include it in a blog post I'm going to do (not here on PF), so I don't want to give away too many details :p I apologise for my terrible understanding of stats, please be patient!

    Anyway, over ten years I have 20 data points for each year, i.e. 200 in total, which are positive integers. In practice they are never higher than 2000, although conceivably they could be. The assumption is that each number is generated randomly.

    1) How do I test if a given data point is too large to be random, given that the other numbers tend to be smaller?

    2) A 'source' produces one data point a year, how can I test if this source is producing abnormally high numbers over the ten years?

    Thank you for any help.
  2. jcsd
  3. Oct 23, 2014 #2


    User Avatar
    2017 Award

    Staff: Mentor

    For both cases I would compute the mean and standard deviation of the distribution of your 200 data points.
    1) You can check if they follow a typical distribution (most notably the Gaussian distribution). If yes, everything that you would not expect given this distribution might be some real effect. You cannot be sure without a clear model, but you can get a good idea with that method.
    2) Check the mean and expected deviation of this mean, see if it is compatible with the first distribution.
  4. Oct 24, 2014 #3


    User Avatar
    Homework Helper

    If all the values are integers they can't be "normally distributed", even if they are symmetric. (The normal distribution is just a convenient description of patters often found in data anyway, no data is truly normal.) But even if they are symmetric, the process outlined above would indicate only that a value is an "outlier" - that doesn't disqualify it as being non-random, simply identifies it as unusual in size.

    As a first step you need to specify what qualifies as "abnormally high" (do you have a specific cutoff for that? If not, then saying something like any value more than 3 standard deviations above the mean, or more than 1.5IQR above the third quartile, is needed). Once this is done you might
    * graph the data over time and look to see which years, if any, have unusally large values
    * look at a plot of each year's data (boxplot?) to check just that group

    But again, first making a more specific description of what you mean is where you need to begin.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook