Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Any statisticians about? Your analysis is needed

  1. Aug 23, 2003 #1
    Any statisticians about? Your analysis is needed!!

    I'm currently putting the finishing touches on a program that I created in response to something that came up on another forum I frequent.

    There is someone there who claims that they have the ability to influence the results of video games, and the state of stoplights while they drive, through their magical power. They are not joking.

    I have created a little game to test this out. The game occurs in rounds. Each round, a frog is dropped onto a board. The frog hops towards a stoplight. When the frog nears the stoplight, it randomly turns either green or red (lightState = (rand() % 2 == 0 ? RED_LIGHT : GREEN_LIGHT); ) If the light is green, the frog hops past safely and does a little backflip to show just how happy he is. If the light turns red... well, what do you think happens?

    Before each round, the player is asked to select what they think the outcome will be - whether the frog lives, or suffers an untimely demise. The game keeps track of how many times the player guesses correctly or incorrectly, and how many times each stoplight color comes up.

    Afterwards, I'm asking the player to mail me an encrypted file that contains this information. I'll decrypt and post the results.

    My question to those of you with experience in statistics is: What is the lowest number of rounds I should insist that the game be played before I accept their results? Obviously, the more the better, but there's just so many times you can watch a frog get splattered before the novelty is gone. How many times, minimum, do I have to have them play before we can fairly safely rule out a lucky streak, and determine that it's probably magic at work?

    I'm serious. Help me out, here, I've never had a statistics course, and I'm sure there's a lot of interesting mathematics behind analyzing this.
  2. jcsd
  3. Aug 28, 2003 #2


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Before we can perform a statistical analysis on this problem, we need to decide how to model it mathematically. This is pretty straightforward, though; since this is a random experiment where each trial has two possible outcomes (right or wrong), we can work with the binomial distribution and assume that the probability that a user guesses correctly on any given trial run is 0.5. Let's denote this expected probability for a correct answer as p and the user's proportion of correct answers to overall trials as p'. The next step is to formulate a test hypothesis:

    null hypothesis: p' = p = 0.5 (the user performs no better than chance)
    alternative hypothesis: p' > 0.5 (the user has some special ability)

    The idea is to accept or reject the null hypothesis based on the data you collect. For any such hypothesis test, there are two ways you can conclude the wrong result; either you reject the null hypothesis when it is actually true (type I error), or you accept the null hypothesis when it is actually false (type II error). Thus, ruling out a lucky streak amounts to avoiding a type I error, but you also want to be sensitive to avoiding a type II error, namely that you conclude the user has no special ability to perform above chance when he/she actually does. Let the probability of a type I error in your statistical conclusion be α and the probability of a type II error be β. The probability of a type I error (and thus the probability of saying that the user has a special ability when he/she only had a lucky streak) is minimized by setting α to a desired value before data collection and statistical analysis is performed. There is a trade-off here, insofar as choosing an especially low value for α results in a relatively high value for β, so it would be inappropriate to make α arbitrarily small; a typically accepted value for α is 0.05.

    (From this point out I will mostly give equations with little or no explanation as to where they come from, since there is a lot of detail and derivation behind them.)

    Let n be the number of trials the user has performed. Once you have collected your data, you calculate the value for the test statistic:

    z = (np' - np) / sqrt( np*(1-p) )

    If z > z(α), the null hypothesis is rejected; in this case, z(α) = z(0.05) = 1.645, so reject the null hypothesis if z > 1.645, and otherwise accept it.

    However, what about β? A type I error is undesirable, but so is a type II error. Let power = 1 - β (in other words, the probability of rejecting the null hypothesis given that it is false). With a small sample size n, your hypothesis test will have a low power; in other words, there is a good chance that you will have accepted the null hypothesis when it is actually false. So we have to worry about the number of trials the user performs after all. Well, what value of n is acceptable? A typically accepted value for β is 0.1 (thus power = .9), with a corresponding z value z(β) = z(0.1) = 1.28. The formula for the value of n needed to give a power of 0.9 to the test described above is:

    n = [ { 1.645*sqrt( p*(1-p) ) + 1.28*sqrt( p'*(1-p') ) } / {p' - p}]^2

    Notice that this formula includes p', which is the proportion of correct choices made by the user in the experiment. This leads to a problem-- how can you tell the user to perform at least n trials of the test, where n is calculated as above, before you have his/her test results? A practical answer is to decide what kind of value for p' you want your test to be sensitive to-- do you want to make sure your test has a power of 0.9 if the user winds up guessing correctly 51% of the time? 55%? 60%? Making the test sensitive to p' = 0.51 with a power of 0.9 would require 21,386 trials, clearly a prohibitive amount not worth the trouble for your purposes. For p' = 0.55, n = 852; still a high amount, but much more realistic. For p' = 0.6, n = 211; even more doable but less sensitive to more subtle 'abiities' on the part of the user. For instance, if you asked the user to perform 211 trials but the user actually did have an ability to guess correctly 55% of the time, the actual power of your test would be 0.81, not 0.9. There are all sorts of trade-offs involved, and ultimately the specific design is up to you; just remember to be careful about both type I and type II errors.
  4. Aug 29, 2003 #3
    Thanks, hypnagogue! That was a very interesting and helpful analysis!
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook