Converting Binomal cpk to a fair dice deviation.

andrewr · Feb 23, 2012

Converting Binomal cpk to a "fair dice" deviation.

Hi,

I would like to detect low probability in a "fair dice" problem being used for graphing of outliers.

The actual data is binned into cells of 1% probability which form a histogram of the dice faces can be drawn. (The original data distribution is not equal probability, however the histogram cell distribution *is*).

I have made the problem equivalent to a histogram of a 100 or 200 faced dice.
eg: there are 100 cells for binning positive only data, or 200 cells for binning signed data.

Now dice are merely a Bernoulli / binomial trial, and I have only two kinds:

100 cell version: n=#diceRolls p=0.010, q=0.990
200 cell version: n=#diceRolls p=0.005, q=0.995

For the 200 cell version, I compute:
μ = np = n(0.005)
σ**2 = n(0.005)(0.995)

But, I am uncertain of a few facts:
1) Does the standard deviation (σ) have a different meaning than the normal distribution? eg: Namely, that not ~68% of data is within 1 sigma of a 100 or 200 sided dice?
2) If the meaning is different, is there a simple formula to convert a { p, q = 50%,50% } binomial's precomputed cdf value into a { p, q = 1%, 99% } cdf probability?

For example: I'd like to compute 'a' for a single cell such that less than 5% of the time would a fair dice be above μ + 'a' samples out of 'n' dice rolls.
I'd also like to compute 'a' boundaries for 1%,0.1%,and 0.01%.

I am able to compute the 'a' value, exactly, even for large n when doing a 2 cell p,q=50%,50% cumulative binomial distribution.

So, I am just wondering if there is an easy way to convert the result of that computation to a 100 or 200 sided dice { p,q=0.01,0.99 } binomial? (eg: that won't cause computer overflow errors.)

Thanks.

chiro · Feb 23, 2012

Hey andrewr.

For 1) the variance is a global measure for all distributions that have a finite variance (some distributions have moments which can't be calculated just in case you're wondering).

What happens in terms of the binomial and the normal is that if np is a 'large' number (some say greater than 10) then a normal approximation can be used where the mean is np and the variance is np(1-p) which is a good thing if you have a lot of cells which is what you have.

So yeah for a normal approximation the mean and the variance are the same for the normal distribution but your original distribution is a binomial distribution and its important to be aware of 'why' normal distributions are used with binomial parameters (i.e. for approximation purposes). The reason boils down to it being 'easier' to calculate probabilities using normal than binomial especially when we have a lot of cells to calculate.

For 2) Again it relates to using a normal approximation. You basically do the same process for calculating quantiles of a normal or whatever other probability info you have but your normal is a N(np,np(1-p)) distribution.

Your situation is common and its one of the reasons we use normal approximations for binomial distributed data.

andrewr · Feb 23, 2012

chiro said:

Hey andrewr.

For 1) the variance is a global measure for all distributions that have a finite variance (some distributions have moments which can't be calculated just in case you're wondering).

OK, fine so far. I imagine there are quite a few unusual mathematical formulas that defy moment(ation)... !

What happens in terms of the binomial and the normal is that if np is a 'large' number (some say greater than 10) then a normal approximation can be used where the mean is np and the variance is np(1-p) which is a good thing if you have a lot of cells which is what you have.

Unfortunately, n of rolls can drop down to even 2.
Even at n=25 rolls, there is only a 25/100 = 25% chance that any given cell is occupied by even a *single* count, therefore most of the histogram (?seems to?) fail your test until about 10 = n(0.01) and ergo n=1000 rolls.

If the distribution were large, I could use the poisson distribution according to my 10 year old stats. engineering text; but I don't think it is large (am I mistaken?)

On the other hand, there is the "chi" test, which is an *inexact* test and which is not easily computable in a non-stats (non matlab/matematica) environment (which I no longer own as I'm not a student any more...), and the alternative hypergeometric *exact* tests are impractical because of the computation issues. (I'm trying to keep this part simple...)

So yeah for a normal approximation the mean and the variance are the same for the normal distribution but your original distribution is a binomial distribution and its important to be aware of 'why' normal distributions are used with binomial parameters (i.e. for approximation purposes). The reason boils down to it being 'easier' to calculate probabilities using normal than binomial especially when we have a lot of cells to calculate.

OK, I do understand that --and I looked it up again to be sure -- although, I am a bit puzzled, eg: I have a program that calculates the binomial probabilities *exactly* as a decimal fraction -- so I don't know if resorting to the bell curve / normal continuous distribution is necessary. Even for n of 1e6, I can still use it as it avoids the factorial or sterling approximations. But it only computes the cpf for a p,q=50%,50%.

By analogy, you seem to be suggesting I use erf() which is the cpf for the normal curve and is the limit for the cpf of a binomal as n->inf, and the standard deviation is used to scale the x axis.

Is there some easy formula to convert the erf() cpf into one where the probabilities n,p are not 50%/50% ? Perhaps I could use that to convert my exact binomial probabilities...

I'd like to post a couple of graphs which give more detail than I can do in a 1000 words...
eg: to show that even for large n, there is a very detectable difference in the variance for 'a' above 'mu' and 'a' below mu; So a symmetrical bell curve seems wrong for this problem.

I'm not sure, though, how to upload pictures onto the physics forum where everyone here is guaranteed to be able to see it; could someone explain how to do that in detail?
I have portable pix maps .ppm files, and I might be able to convert them to .jpg.

For 2) Again it relates to using a normal approximation. You basically do the same process for calculating quantiles of a normal or whatever other probability info you have but your normal is a N(np,np(1-p)) distribution.

Your situation is common and its one of the reasons we use normal approximations for binomial distributed data.

exactly, which is why I am surprised at how the common answers don't seem to work well!

Intuitively, the p,q=0.01,0.99 binomial distribution has a distinct bias since one can "under" roll a slot only by a fixed amount which is far less than one can "over" roll a slot. Hence, the improbability of 'a' is not even symmetrical in a single cell.

By analogy, in a stock market, one may buy or sell, short or long. The difference in potential losses is very drastic. In one case, the losses have a maximum -- in the other -- people often believe there is no limit. (Though, strictly speaking this can be debated...)

I appreciate the help so far chiro.
--Andrew.

chiro · Feb 23, 2012

As you've pointed out, there are limitations in how accurate using these kind of approximations can be.

When you have a system that is as 'skewed' as yours (probabilities close to zero and 1 for example), then you know you will either need a 'huge' number of trials or you will need to use a non-normal approximation or just a computer program to calculate the probability for you.

So yeah I agree this kind of thing is a real pain in the neck when you have a distribution with 100-200 cells each having multiple factorial terms.

Post your pictures as attachments and we'll do our best to help you.

andrewr · Feb 24, 2012

Ok, I uploaded the picture files as attachments, but it only allows 3 and I really needed 4. Are you able to see them? and do you know if there is any way to put them in-line in my text, so I have to write less description?

The first picture is the plot of a 5 million sample Gaussian distribution converted into dice roll quantiles (1% deviation cells); the second is a histogram of the first plot's cell histogram (eg: a gaussian check of the dice roll distribution). In theory, if your idea of a large number of points were truly estimated by a normal curve -- the data in that plot would be "normal" looking... ?

I haven't tried folding the plot to make the deviation sign-less as that might be a bit closer to "normal". Perhaps tomorrow I'll try that out; but the problem still exists for signed data.

The quantiles themselves (1st plot & last plot) look visually to be uniformly distributed -- eg: as would be expected for a normal distribution; but the visual picture is mildly misleading.

If I plot a 1000 data point Gaussian, it's deviation from the bell curve is far more severe than the 5 million data point example. (3rd plot). But the bell shape can still be detected and the discreet granularity of the dice rolls becomes very visible.

reminder: What I am wanting to add to my plot (eg, the first and last one) is a colorization or enlargement of data points on the histogram to indicate outliers. eg: for quantiles which are 5% chance/plot probable, one color, for 2% chance/plot another, for 0.1,0.01, 0.001 which are severe, I was thinking to change the size of the data points to make them REALLY obvious.

Any thoughts on how to attack this a-symmetrical :devil:

?

Edit: the second graph is changed as it was curve shape regaussed, instead of cell percentile -- my apology for any confusion this caused.

andrewr · Mar 8, 2012

Estimating outliers by pseudo multiplying data-points using Gaussian EXTRAPOLATION

I studied the problem a bit more this week; I found something I hadn't noticed before.
On the graphs, the median value for the Gaussian distribution of "fair dice" cells wasn't 1.0 (100%), which is by definition, wrong.

When I looked at my program closer, I realized that I was dividing the data up into 101 or 201 cells, rather than 100 and 200; By counting twice with this error, I arrived at a much more skewed (amplified) distortion than the actual case...

See below for new graphs of the same random generator data-set, but counted correctly.

I also added a smoothing algorithm for the curve shape to make it easier to see trends.
1)

Do you think the following idea is actually doing what I think it is doing?

What I did is to take a running average of 3 neighboring cells, compute the center cells |deviation| from the average of three cells -- and then distribute a fixed percentage of that "error" deviation (50%) to the cells on the right and the left, BUT in proportion to e**(-0.125*x**2); eg: it's the diffusion equation for heat, etc, based on the difference in cell center as to how much probability will diffuse to the right or left cell; I am hoping that this will effectively keep the same shape of the distribution -- but as if many times the data points were sampled into 100 cells, such that a normal approximation will become valid... (chiro is correct about that...)

2) I have a slow Bernoulli trial computation used for the discrete non extrapolated distribution; how do I make it fast?
Since I don't know if 1) is accurate, I compute bp(p,q,n,r) in my program (now) with p=0.01,q=0.99 , and that gives me the probability of outliers correctly (Not shown in plots, yet);
I still find that the deviation of dice rolls from the "average" is still asymmetrical as noted before, but it does get more "normal distribution" looking as the number of data-points counted becomes >> 1000.
(Thanks chiro! I just wish I had spotted the 101 vs. 100 cell problem earlier! ;) )

Does anyone know of a fast computing approximation to a binomial, that is more accurate than taking e**-0.5*z**2, and fitting z to the binomial's pn mean and (pqn)**0.5 deviation? (Standard textbook approximation of the binomial by the normal curve)
eg: I need an algorithm good for all n, from 2 to 1e7 data points... !
eg: something which is more accurate near the tails? (I don't care about the "center" as outliers are never near the center, obviously...)

3) :!) Inspecting the last two graphs, I know they are going to be skewed; and I was thinking I could possibly replicate the idea of binning of a continuous bell/Normal curve into 1% quantiles; but do it with the bernoulli trial p=0.01,q=0.99 made into a pdf with n→∞; eg: scaled by keeping the means and asymmetric deviations preserved to scale n vs the x-axis (deviation). I could then make cells to bin the data into the percentile chance of the particular dice roll, vs a normalized 100% Bernoulli trial; and then make smaller cells in the outlier regions to find 0.5% 0.1% 0.05% 0.01% probabilities

Does anyone know a continuous PDF for the Bernoulli trial? :!)

Converting Binomal cpk to a fair dice deviation.

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Attachments

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How do E[X] and E[|X|] relate?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect