# Generating Probability Resulting in a Normal Distribution

1. Aug 28, 2012

### BubbaXXL

I'm programming a process that runs in iterations with a probability of stopping at every iteration. As it gets closer to the desired run-time, this probability should increase so that if the process is run many times over, the runtimes will form a normal distribution, with most of them having the desired/mean duration (number of iterations for which it ran) and very few of them either stopping at start or running for twice as long (at the close-to-zero edges of the bell-curve).

How can I get the probability for stopping at every iteration as the duration increases and the process iterates along the curve?

I think this might be a more mathematical interpretation of my problem:

I tried finding the probability that the process does NOT stop at a certain iteration by taking the cumulative distribution function for the duration and then dividing it by the probability of the previous iterations in order to get the probability for that one event, but the values I got were skewed to the left, as in most cases having a duration under 20 (with 20 being my desired mean)

Last edited: Aug 28, 2012
2. Aug 28, 2012

### Stephen Tashi

Do you mean "natural"? or "normal"? A "normal" distribution implies a specific family of distributions. Measurements of a positive quantity (such as elapsed time) cannot really have a normal distribution because they are bounded below by zero. They might have an approximately normal distribution.

3. Aug 28, 2012

### BubbaXXL

Well, all I really want is for the distribution of the different runtimes to have a bell-curve shape like in the picture on the right here: http://en.wikipedia.org/wiki/Normal_distribution

Whichever implementation is the simplest and gives me the most control.

my CDF function currently just looks up the values from a CDF table for the range -3.09 to 3.09 so it isn't algebraically precise anyway. I then just divide the values in order to fit them to the 3.09 scale, like this:

prob = (1 - cdf((duration - 20)/20 * 3.09)) / previousProb

Oh yes, I guess what I meant is normal, yes.

Last edited: Aug 28, 2012
4. Aug 28, 2012

### Stephen Tashi

On the internet, one time-honored way of answering the question "How do I do this" is to say "You don't want to do that". I'm curious why in the world you would want to a normal distribution of run times, but, for now, ill accept your goal.

As I understand your question it is this:

Let P(S_i ) be the conditional probability that the process stops at the end of step i given that it has reached step i. Let X be the random variable whose value is the number of steps that occur before the process stops. You want a formula for assigning the values of P(S_i), i = 1,2,... so the distribution of X will be approximately normal. You also seem to have specific mean and standard deviation that you want to attain.

If that's what you're asking, my guess is that there is an approximate answer and someone on the forum can probably figure it out. I'm not sure I can.

5. Aug 28, 2012

### BubbaXXL

Thing is, I think I already know a different way to achieve the desired results by calculating the runtime for every process before it starts.

I would just have the computer generate random values in 0.0-1.0 and apply the Box-Muller to them, or some better transform, giving me a normally-distributed random value with which I could modify the desired runtime for every process to be either shorter or longer.

I would prefer having it achieve the desired distribution by checking a probability on every iteration though.

6. Aug 28, 2012

### BubbaXXL

Yes, I think that is correct if I understand it right.

As for the mean and the standard deviation, I adjusted the standard normal distribution CDF values to fit my normal distribution using (duration - 20)/20 * 3.09), perhaps that messed it up?

I'm trying to code a sub-routine for a simple intelligent-agent (http://en.wikipedia.org/wiki/Intelligent_agent) that is supposed to simulate natural behavior. having the runtime be constant or vary by plain randomness would make it robotic.

Last edited: Aug 28, 2012
7. Aug 28, 2012

### Stephen Tashi

I can see why you don't want the runtime constant. I don't know what you mean by "plain randomness". If you made the probability of stopping at each step some small constant probability, you'd get an approximately exponential distribution which is a distribution commonly used to model the length of time to complete a task. Is that what you mean by "plain randomness"?

---

To put this in terms of know mathematics, we are trying to find a "hazard" function. This comes from survival analysis: http://en.wikipedia.org/wiki/Survival_analysis. To me, the subject is unpleasant to think about because the terminology is all backwards from the point of view of normal probability theory. It focuses on functions giving the probability of t > T instead of the usual cumulative distributions that give the probability of t < T.

Last edited: Aug 28, 2012
8. Aug 29, 2012

### BubbaXXL

The most simple form of randomness I have is the built-in random function which gives me a pseudo-random number in a specified range. I was assuming that a constant probability would give me a uniform distribution where the process will stop at the first iterations equally often as it will run for a mean duration.

I guess if one looks at the probability of the process not stopping up until a certain step, that probability will decrease as it gets multiplied by every previous probability. It's complement probability would then be the stopping probability which would increase.

Would the distribution curve look identical to the probability curve?

EDIT: I guess "plain randomness" would be the kind where I pre-determine the runtime using a single random value at the beginning of the process. That would give a uniform distribution.

Last edited: Aug 29, 2012
9. Aug 29, 2012

### Stephen Tashi

What do you mean by a "probability curve"?
Are you asking the cumulative distribution versus the probability density? They would not be identical.

If you have a constant probability of stopping on each step then the random variable that gives the number of steps until you stop is a "geometric random variable". It's distribution is a geometric distribution: http://en.wikipedia.org/wiki/Geometric_distribution

10. Aug 29, 2012

### BubbaXXL

if the first curve would represent the probability of stopping on every step, and the second would represent the number of trials that stopped on every step.

11. Aug 30, 2012

### Stephen Tashi

I assume that you meant the first curve is g(n) which gives the probability of stopping on step n given that the process has reached step and that the second curve s(n) gives the probability of stopping exactly on step n. These two functions are, in general, different.

If you want a mean of $\mu$ and a standard deviation of $\sigma$, I think the probability of stopping on step n, given you have reached step n without stopping should be computed as follows:

Let $f(x)$ be the probability density of the normal distribution, so:
Let $$f(x) =\frac{1}{\sigma \sqrt{2 \pi}} e^{ -\frac{1}{2}( \frac{(x - \mu)}{\sigma})^2}$$

Let $F(a,b) = \int_{a}^{b} f(x) dx$

Set the probability of stopping on step n to be:

$$g(n) = \frac {F(n-1,n)}{F(n-1,\infty)}$$

12. Aug 30, 2012

### BubbaXXL

I've only formally studied probability math to the point of combinations and permutations, so some of the statistics concepts and terms get confusing to me. Perhaps I'm asking an obvious question. If I were to have a constant probability at every step, giving me a geometric distribution, will the two following discrete functions be identical in shape:

"g(n) which gives the probability of stopping on step n given that the process has reached step"

r(n) the number of runs of the process that actually stop on step n. (if the process were to be run a very large number of times). If I ran the process 5 times and it would stop on step 2 three times, r(2) = 3

If the mean of the geometric distribution is where most of the processes should stop, they cannot be identical though.

(As far as I understand the CDF, it also includes previous steps so, if the same example stopped once on step 1, cdf(2) = 4)

13. Aug 30, 2012

### Stephen Tashi

The function g(n) would be the constant function g(n) = p.

The probability of stopping exactly on step n is $(1-p)^{n-1} p$.

You can't be sure of any specific number of outcomes. If you ran the process a large number of times $M$, you can talk about the average number of times that it would stop on step 2. I think that average number is $M (1-p)^1 p$.

The most probable outcome of a random variable is called its "mode". It's "mean" corresponds to what laymen call its "average". You can see from the graphs of the geometric distribution in the Wikipedia article that the mode of the geometric distribution is the event of stopping on the first step. You are correct that the average value of the geometric distribution is not its mode.

If you ran a large number of trials the fraction of them that stopped exactly on step 2 has a high probability of being close to (1-p)p

The value of a cdf is a number in the interval [0,1], so it can't be 4.

The cdf F(k) of the geometric distribution evaluated at 2 would computed by:

F(2) = probability of stopping exactly step 1 + probability of stopping exactly at step 2
= p + (1-p)p

(As the Wikipedia article notes, there is some inconsistency in the way "the" geometric distribution is defined in textbooks. Note what it says about the "shifted" geometric distribution. I'm using F(2) to mean the probability of stopping on or before step 2. )