# Expectation value for CDF fails

1. Jun 4, 2012

### andrewr

Hi,

(Sorry for the slight misnomer in the title... I can't edit it!)

I'm doing several problems to compute the expectation value and variances of sub-samples & operations on the normal distribution; and I am having trouble getting results that agree with numerical simulations.

I have several Gaussian random generators written in C, and am using vectors of length 1million to 10billion to do numerical experiments; all generators appear to converge accurately to at least 4 decimal places with the sample sizes I am using. For example in μ(σ) notation, adding 0(1) + 0(1) gives 0(1.414), exactly as expected.

The random generator is crude, but replacement of it doesn't change the result.
Program is attached...

So, for my first (trivial) problem (and hopefully the same mistake is in my other questions!):
When I do Ʃ |0(1)|/n, I get 0.8933.

However, if I try to solve the latter problem analytically, I get a different result.
Could someone point out where I have made a mistake, please?? :)

I think the pdf for a Gaussian is:
$$p(x)= \sqrt{1 \over 2\pi}e^{-{1\over2}x^2}$$

Since abs() only uses Half the Gaussian curve, the pdf ( w/ domain [0,∞) ) becomes:
$$p(x)= 2\sqrt{1 \over 2\pi}e^{-{1\over2}x^2}$$

To compute the expectation value, all I ought to do is solve:
$$\overline{X}=\int\limits_0^{∞}x \times p(x)$$

$$\overline{X}=2\sqrt{1\over2\pi}(\left.-e^{-0.5x^2}\right|_0^{∞})=2\sqrt{1\over2\pi}≈0.7979$$

Which is, of course, wrong....
So, what did I do that messed the result up?

#### Attached Files:

• ###### compute.txt
File size:
3.2 KB
Views:
81
Last edited: Jun 5, 2012
2. Jun 5, 2012

### haruspex

I set up a spreadsheet using RAND() and the Box-Muller method to generate sets of 600 random values with N(0,1) distribution. The averages of the absolute values hovered around .75 to .85.

3. Jun 5, 2012

### andrewr

Hmm...
I just wrote a small test program in Python to double check, as it uses a ratio of uniform [0,1) randoms to generate a Gaussian -- and it does indeed come up with 0.7978 for 10Million vectors.
So it does have to be my program.

Thanks.

Edit:
Ohhhh! I just found it, I was still taking the root of the answer and computing the standard deviation -- not the mean/expectation....
Sigh... Thanks again, and very much!

--Andrew.

Last edited: Jun 5, 2012
4. Jun 10, 2012

### andrewr

Unfortunately, my other mistakes are not in fact related...

Side note: For those downloading the program out of curiosity, there are bugs in it. (The rejection method needs to be multiplied by 1/sqrt(2 pi). etc.

I have found & written much better Gaussian random number generators; and I'd rather share good code, than leave broken code around in the near future: I am attempting 1 trillion computation calculation experiments, now, and it is becoming more important to SEIRIOUSLY optimize the generator for speed AND accuracy. I'm sure some of you would like to do the same -- but easily... I'd like to help.

To my surprise: The Box muller method really isn't that fast! although it is reasonably accurate. Computing the logarithm function appears to be the slowest possible math function for doing Gaussian generation -- and (to my surprise) almost all *accurate* random number generators use it on every computation.

I have come up with a much better method (and of course, found out someone else discovered it before me...) but I am re-writing it to avoid the logarithm function as much as possible. This gives a 10x speed improvement on my machine -- so it IS worth chasing after....

On to my mistake ... This one has to be analytical:
I am attempting to compute the standard deviation of two N(0,1) distributions added together -- and the answer is sqrt(2). I'm not sure where I lost the constant, but I get sqrt(4) -- and I am thinking, maybe when I did a change of variables -- I forgot something. I just looked it up, and I am tired enough -- maybe I am just blind to what step I mis-copied.

The probability of each data element is:
$p(x)={1 \over k}(e^{-{1\over2}x^2}e^{-{1\over2}y^2})$
$k=\int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty } e^{-{1 \over 2}(x^2+y^2)}dxdy=\pi$

$\sigma^2(x,y)=(x+y)^2$

Which makes the weighted variance:
$p(x,y) \sigma^2(x,y) = {1 \over \pi}(x+y)^2e^{-{1 \over 2}x^2}e^{-{1 \over 2}y^2}={1 \over \pi }(x+y)^2e^{-{1 \over 2}(x^2+y^2)}$

So the combined equation yields:
$\overline{ \sigma ^2 }= \int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty }{1 \over \pi }(x+y)^2e^{-{1 \over 2}(x^2+y^2)}dxdy$

And I think this is where I might have messed up? I needed a change of coordinate systems/variables...
$\overline{ \sigma ^2 }= \int \limits_{ 0 }^{ 2 \pi } \int \limits_{ 0 }^{ \infty }{1 \over \pi }(r^2+2r \times cos( \theta ) \times r \times sin( \theta ) )e^{-{1 \over 2}r^2}r \times dr d \theta$
$\overline{ \sigma ^2 }= {1 \over \pi }\int \limits_{ 0 }^{ \infty }\int \limits_{ 0 }^{ 2 \pi } (1+ 2cos( \theta ) sin( \theta ) )r^3e^{-{1 \over 2}r^2} \times d \theta dr$
$\overline{ \sigma ^2 }= {1 \over \pi }\int \limits_{ 0 }^{ \infty } \left. ( \theta - {cos^2( \theta )} ) \right|_{ 0 }^{ 2 \pi } r^3e^{-{1 \over 2}r^2} \times dr$
$\overline{ \sigma ^2 }= {1 \over \pi }\int \limits_{ 0 }^{ \infty } 2 \pi \times r^3e^{-{1 \over 2}r^2} \times dr$
$\overline { \sigma ^2 }= 2 \left. ( -e^{-{1 \over 2} r^2} (r^2+2)) \right|_{ 0 }^{ \infty }$
$\overline { \sigma ^2 }= 4$

Sigh, It must be simple -- but I don't see it....

Last edited: Jun 10, 2012
5. Jun 10, 2012

### haruspex

$k=\int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty } e^{-{1 \over 2}(x^2+y^2)}dxdy=2\pi$
Wrt methods for generating N(0,1) values from the uniform distribution of RAND, I've read that simply averaging batches of a dozen or so gives good enough results for most purposes, and much faster. Haven't tested that.

6. Jun 10, 2012

### andrewr

I copied that result directly from a table.... Figures, I didn't notice that it was an integral of exp( x^2 + y^2 ) and not 1/2 of that....
Thanks for releasing me from mathematical hell...

Regarding the generation of N(0,1);
I have experimented with that, and I'll point out some issues -- and future directions I plan on testing...

When I benchmark general operations inserted in measurable places within the program I submitted, and including any penalities from OS exceptions, etc. I come up with the following approximate times for 1 billion iterations.

*, or + of doubles = indistinguishable from 1 or 2 seconds.
exp() --> 46 seconds
sin(),cos() --> 15 seconds.
sqrt() --> 36 seconds.
Rand01() --> 60 seconds.
log() --> 1 minute 27 seconds.

Generation of decent random numbers isn't a really fast operation; I improved the generator I posted before to use many less operations, to fit in a CPU cache (well), remove IO and Unix dependencies... and it still wasn't fast...

To average 12 of them would be far slower than using the log() function....
I plan on exploring the Merzenine (sp?) twister core generator to see if it is much faster, but I doubt it.

I tried a three sum RAND() generator, and it produced significant errors; and was already slow.

There are a handful of Gaussian generation techniques that I find in the literature...
Trimming by rejection of either a pure Rand function, or a distorted version of it;
Addition of normalized (standard deviation scaled) RAND numbers -- because sums of Gaussians are also Gaussians...
Piecewise linear selection of random or "binned" values (Ziggaraut method), and several other hybrids...
And transformation based generators.

wrt: summing RAND()'s, I think it the addition of normalized RAND's might be fast if applied to say, a triangle distribution, by summing two RAND() values. That is something I will probably try.

The transformation class of generators is the one I believe can be made be the fastest. I haven't figured out how to code the Ziggaraut, but I would like to try it -- for that is the only serious contender that I can see by estimation of the operations used.

The main problem is that CPU Cache times are non-intuitive on the newer processors, and it isn't trivial to find out what is causing slowdowns on memory access bound routines. The zigaraut method relies on tables in memory, just as my random generator does. This is typically a blazingly fast method of computation -- but to my surprise, it really isn't in the routines I bench-marked. So, although some sites claim the Ziggaraut to be the fastest routine -- I am wondering if that is a hand optimized version, or a generic version. eg: I am not wanting to produce CPU specific code, so I would rather test code that is only written at the C level.

I found a generator very close to the idea I have for a transformation -- it is called (of all things, as I like Python...) the Monty Python method...

But the simplified description makes it clear that it too uses either the log function or the exp() function for a very large percentage of the operations. (No actual code was available in the descriptions I read.)

In my idea, I can reduce that percentage massively -- so that only 5% or less of the samples will use the exp() function -- (and perhaps twice...); so that I hope to target the add/multiply at two seconds with 6% of the exp time as a penalty (around 3-4 seconds.) I figure that's the best I am capable of coding with what I know...

We'll find out soon!

7. Jun 15, 2012

### andrewr

Let the real fun begin -- I got stuck, (of course), in the tail region of the Gaussian random generator I am making. It is going to be quite fast and accurate regardless of whether I fix a small error in the tail region, but I have a question about how to correct it in a certain way...

I'm attempting to make this post readable by breaking it up into chapters. I also attached a graph to this post which I will refer to constantly -- and unless you're blind, it will be extremely helpful explaining obscure sentences... (I did my best to be clear, even if you are blind.... ;) )

The question itself can be found by looking at the chapter "-------" headings.

-------------------------------------------------------------
General/generic background

All generators of random distributions produce random numbers in proportion to the differential area under the curve of choice. So, in principle, one can locate a point "inside" the distribution by choosing two uniform random numbers -- one for the y axis and one for the x axis, and then seeing if the y value is "under" the curve. (AKA the x,y pair is found in the area under the curve.)

This method forces every point in the randomized uniform rectangle to have equal likelihood -- and since the number of points ( any y < curve's_y(x) ) making up the vertical line under the curve (at each x) is equal to the probability of choosing that x value by random chance, that's how often one should accept an x value that was uniformly chosen at random.

Therefore, all x,y based generators accept an x value *iff* a random y value is "under" the curve's y(x) value.

The easiest generation of the probability distribution, is to randomly select an x,y pair and throw away (reject) values until an x,y pair is found *inside* the area of the distribution curve. (AKA found Under the curve.) This final value is the only one returned.

This method has one unfortunate consequence when doing Gaussian generation: The number of rejects is *very* large.

Since the reject area outside a Gaussian grows rapidly as x values go beyond the 3 sigma mark (eg: the y of the curve becomes tiny) -- that means all practical Gaussian rejection generators must truncate the maximum x value they will randomly choose (I used 8 sigma as the limit in my last program...). This of course introduces error, but keeps speed of the program from being infinitely slow in picking a number...

The Monty Python method of generating the Gaussian distribution uses the x,y technique -- but attempts drastically reduce the rejection area outside the Gaussian distribution (eg: any random_y(x) > Gaussian_y(x) points).

The way the M.P,M does this, is to choose a fixed randomizer x,y rectangle which doesn't enclose the entire Gaussian curve's y values; Rather the rectangle excludes regions where acceptable y values exist, and hence can never be picked. BUT those excluded areas are re-mapped into areas inside the randomizer rectangle which are rejected otherwise. Hence reject points for one part of the Gaussian become acceptable points in another part -- only re-mapping of the x value is required.

Even then, however, compromises are required for computation speed since not all areas can be mapped into the box efficiently by simple translations and rotations. These compromises inevitably leave some reject areas inside the randomizer rectangle.

But, *if* the area of the randomizer rectangle is made exactly equal to the total area enclosed under the Gaussian -- then any reject area *inside* the x,y rectangle must exactly equal all areas not packed inside that rectangle and generally speaking, this happens to be the "tail" area previously mentioned...

----------------------------------------------------------
How I chose to partition the Gaussian for a M.P.M approach.

In the graph, there is a blue rectangle which has the same total area as the Gaussian curve it maps; call this the areaBox; now notice at all small x values (x<spin), there is an area under the Gaussian, but still outside the areaBox -- this area (a0) is clearly un-pickable by x,y pairs inside the areaBox.

Since area a0 touches the top of the areaBox, the area just below it (a1) has no x,y values that outside the areaBox, nor values above the Gaussian -- therefore these points are never to be rejected, and this area always returns without any curve computations (VERY fast.)

Immediately to the right of a1 (eg: x>spin and x<2spin) the Gaussian curve is inside the areaBox, and also divides it vertically into an accept and reject region. The accept region is a2, and the reject is anything above a2.

To remove the rejection area above a2 as much and as efficiently possible (speed is an issue), it is filled by simply rotating area a0 by 180 degrees into translated area a0r; (the rotation is around point x=spin, y=top_of_areaBox ).

Because a0's x domain is from 0 to spin, the area when rotated will start at x=spin and end at x=2*spin, the latter x always happens to fall short of the maximum x value of the areaBox -- which makes packing imperfect.

The point x=spin is an arbitrary constant, but it has two issues affecting what value is best; it must be less or equal to 1 and it determines the height of the areaBox. For maximum usage of the areaBox, spin needs to be as large as possible -- but there is a tail curve fitting trade off which makes "spin" values in the range range of 0.95 to 0.98 very desirable. Spin may never be greater than 1.0, because a Gaussian curve has an inflection point there -- and therefore greater values will cause a2 to overlap with a0r;

Once this x=spin is chosen, the two areas a2 and a0r (found in region x=spin to x=2spin), will still have a y oriented reject area between them called e0.
The requirement that spin<1 also guarantees the existence of a small un-packed area to the right of e0 which ends at the edge of the areaBox. This latter region is beneficially less than 5% of the x values in the areaBox.

Hence, by virtue of area conservation -- the rejection areas of e0 and of areaBox x>2spin, must exactly add up to the tail region found immediately after x=2spin and extending to infinity. This tail is troublesome to implement -- but I discovered something useful:

-------------------------------------------------------------------------------------------------------------
A very useful discovery regarding the problematic tail region and it's efficient generation...

If useless area e0 is mirrored across s=2spin, (as drawn), such that x values in e0 become mirrored x values in e0m -- it reduces the tail area to the cyan colored region; And this tail region, if graphed without the e0m below it -- happens to be almost exactly another Gaussian curve....

Recursively calling my version of the Monty Python random Gaussian generator, will produce this tail perfectly -- with all values of x to infinity "potentially" returned (The recursive Gaussian x values add).

I curve fit the value of "spin" and a scale factor (s) to transform the mirrored x values (of em0) so that they maximize a Gaussian fit to the tail area.
The worst case error is around 0.6% (see the cyan error plot which is in percent...).

Since the em0 correction only extends a distance of ~1 spin, the correction ends at around x=3*spin --- and the remainder of the equation reverts to the tail of the *original* Gaussian, therefore this discontinuity is the cause of the maximum error. However the max error occurs roughly after the, 3 sigma mark which means this error is only in about 3 of 1000 random numbers; (but as the call is recursive, the uncorrected errors compound *very* slightly...)

Notice, however, the tail portion corrected by em0 (eg: with 0.2% error remaining) is executed far more often (around 3% of the generator calls) and is far more serious to minimize error in. I don't have an exact estimate of the total displacing area of all errors -- but it isn't very big. I am thinking < 0.2% * 4% = 0.8% (and perhaps quite a bit less because of exactly where it occurs...)

The exact anti-derivative of the tail region for the corrected em0 portion, is listed at the end of the post. It is invalid after the em0 portion...

------------------------------------------------------------------
Question: How to transform a perfect Gaussian distribution into the desired almost Gaussian required for the tail region to be correct...

I know that given an area formula for a distribution (anti-derivative) -- one can use a method called inversion to produce random samples from that distribution using a single uniform random number; It simply requires plugging a uniform random number into the anti-function of the anti-derivative.

What I need is a distribution generator equal to the tail area in cyan -- it isn't an exact Gaussian -- but VERY close to one.
Since I know the anti-derivatives of the tail region, analytically, And I know the anti-derivative of the ideal Gaussian -- I think there must be a way to convert the Gaussian into my tail distribution.

How do I do this? A generic discussion of converting any distribution into any other via calculus integrals would be very helpful (it doesn't need to be this problem here, just a *clear example* of how this might be done.) Is anyone able to help?

-------------------------------------------------------------------------------------
Appendix: Information to reproduce my graphs and equations...

The graph attached was produced with the following equations using gnuplot...
They may (of course) be used in other graphing programs that understand the C like ternary operator (condition)?(evaluate when true):(evaluate when false)
Gnuplot also does not return errors on divide by zero, but does not plot those points. I find that useful to limit where an equation is displayed....

n(x)=exp(-0.5*x**2)
serf(x)=erf(2**-0.5*x)

t(x)=area/n(spin)
areaBox(x)=(x < t(spin) ) ? n(spin):0

error(x)=(x>=0) && (x<=spin) ? 2*n(spin) - n(x) -n(2*spin-x) : 0
rot180(x)=(x>spin) && (x<=2*spin) ? 2*n(spin)-n(2*spin-x) : error(x-2*spin)
residual(x,s)=(x>=0)?( n(x+2*spin) - error(x/s)*s ):1/0
tail(x)=residual(x,s)/residual(0,s)

Note: semantically, residual computes the corrected tail region -- tail(x) is a slight misnomer, it actually scales the tail region to 1.0 to ease comparing it to a normal curve.
The following constants were the ones I thought made the final tail fit to a Gaussian nearly optimal...

s=1.02803051369197 scales the mirrored e0m's x values in the tail -- this adds a degree of curve fitting freedom. The scaling preserves area...
spin=0.966804869573191 This was the best value of spin to make the tail region as Gaussian as possible... and keep e0 relatively small (tradeoff issues).
stdS=0.609743713386175 The ideal Gaussian to be matched against in the tail region has a standard deviation in x which isn't unity -- this estimates the actual.

The area of the tail function in the corrected em0 region only, ought to be (barring mistakes...):
$$\int n(x')-e_{0m}(x') = \sqrt{\pi \over 2}( serf(x+2spin) + serf(x) - serf(2spin-x) )-2x \times n(spin)$$

After this region, it simply reverts to the tail of the original Gaussian, and is just plain old erf() scaled and translated appropriately...

#### Attached Files:

• ###### gausian.png
File size:
22.2 KB
Views:
97
Last edited: Jun 16, 2012
8. Jun 16, 2012

### viraltux

Well, this is a good example that we cannot trust everything we read in the Internet :tongue:

This method is not faster but actually horribly slow, the RAND generation is actually one of the bottle necks in this kind of situations; generating a good random number is expensive so the ratio of RAND numbers per distribution number should ideally be 1/1, in some difficult situations you might settle for 2/1, but a ratio of 'dozen or so'/1 is by no means reasonable. The moment you read something like that there is no need to keep reading (or keep reading with a raised eyebrow). And not just that, in this approach speed is not the only problem, accuracy is terrible as well since it entirely relies on the Central Limit Theorem... so yeah.

9. Jun 16, 2012

### viraltux

Hi Andrew,

The two fastest approaches that I know of to generate random normal distributions are the inverse method and the Box-Muller. The Ziggurat method, though mathematically sound, when implemented in parallel computing you get an increasing divergence proportional to the number of threads; a good example for the wisdom "theory is better than practice in theory, but practice is better than theory in practice" :tongue:

So since you are trying to work out a Ziggurat like algorithm maybe you want to check this paper first (Section 5.4): http://www.mendeley.com/research/state-of-art-of-heterogeneous-computing/

The inversion method relies in a very efficient coding with machine accurate results of the inverse normal probability function. This is the method used by default in R (which IMHO says something about it)

Box-Muller is also good since you have a ratio RAND/Normal of 1/1 and there are specific machine develop methods (GPU, math co-processor, etc... ) to calculate sin, cos and exponentials of imaginary numbers very efficiently.

10. Jun 16, 2012

### andrewr

I read through the paper; The application I have is just basic experimentation -- and I am actually more interested in the analytical solutions which a mathematician would come up with than the software; My need for moderate software speed is just to do verification of what I come up with analytically... but I'd like to run tests that have around a trillion samples; that means with the generators I have available -- that takes close to a week of single CPU time.

I'd love to use Xilinx Vertix 5 chips -- but that is out of budget for a hobby project; and the older 4K series of xilinx chips are way too slow (I still have some...).
Also, as this is a one shot study -- buying GPU's is also a bit of a waste (I am here to keep in practice, more than to solve a problem with \$ attached.)

Yes, I am sure the inverse method is very accurate -- however, the inverse of erf() has no closed form -- so I am not sure how the polynomial curve fits would have been developed for the inverse function used in R that are machine accurate AND fast.
I would be interested in looking at such polynomials, and to experiment with them -- but I don't know how I would create them from scratch in any reasonable time frame.

Yes, but on Intel CPU's (which is a big portion of the audience here...) -- the Log function is often the slowest math operator possible on the math - co-processor enabled math library (gcc/visual studio C libraries.). Doing just a basic benchmark as I did in an earlier post is a sample of what I have come across. Intel does not appear to have a very efficient hardware implementation of ln() on most of their processors.

As for box-muller, The reference implementation I gave in the program at the start of the thread actually requires TWO RAND's for 1 result. Although some people tout the orthogonal nature of sin and cos to produce two results -- in practical applications, their relationship seriously skews the result of calculations... Also, it requires one log operation each run -- and on Intel that is very slow.

The article made some excellent points regarding random number generators needing to use the word size of the CPU/computer; I used byte for simplicity, and that has memory access (copying time) penalties which slow the random generator down significantly.

Since I am not using parallel computing -- I am not sure what the issue concerning divergence really is; but I would be interested in your comments -- is this a synchronization issue between thread outputs of random generation?

For my purposes, I want to learn about the calculus involved in warping one random distribution into another -- and figure out ways to test a Generator's overall accuracy.
I figure that if I keep a running list of 4 Gaussians made by my generator -- that linear combinations of these results will have *significantly* improved properties;
since linear operations are so fast as to have almost no penalty -- that is a practical solution to my generator's problem... but I'd like to go farther than the knee-jerk pragmatic solution just for knowledge's sake.

--Andrew.

11. Jun 16, 2012

### viraltux

I doubt very much they are using simple polynomials to approximate the inverse but, luckily, R is a FOSS project which means you can hack the code all you want: http://cran.r-project.org/sources.html

If you get skewed data you're doing something wrong, with Box-Muller two RAND give you two independent normally distributed values.

I haven't seen the details of the Ziggurat algorithm parallel implementation myself, but I don't think it is a sync issue but rather something related to the inner complexity of the Ziggurat algorithm itself. Maybe you can find more info in the paper's references.

Well, if you care more about the theory than the practice that puts Ziggurat like algorithms back on the table and the conversation shift entirely to the math side.

I don't think though the linear combination is a good idea. You're right that the linear combination has virtually no penalty but you have to think that you will have to generate 4 numbers to get only one with improved properties and that defeats the whole purpose; by doing so you are making your algorithm around 4 time slower that it should be.

Oh wow, I was going to recommend you to check the work of Mr. Marsaglia http://en.wikipedia.org/wiki/George_Marsaglia and I just found out he passed away last year!! Well... Long life the King. That guy actually was the man when it came to randomness and computing.

Anyway, there are lots of bibliography on this subject, so you will not get short of it.

You know, whatever the method you see around they always have two steps; firstly to generate random numbers from 0 to 1 and secondly to use those numbers to generate a particular distribution. Back when I was working with random numbers/generators I thought about the possibility to create an algorithm that would do both at the same time, that is, an algorithm that directly generates a Normal distribution.... Well, easier said than done but I still think that this approach might prove fruitful.

Anyway, have fun

Last edited: Jun 16, 2012
12. Jun 17, 2012

### andrewr

Excellent, I'll look there.

Yes and no. There is a relationship between them that they both always have the same "radii"; so that if one computes the square of the two "independent" variables and adds them, the correlation with the NON-Gaussian radius becomes part of the result. In most applications this is irrelevant -- but not always. In a true independent generation, the square of the results will not correlate with a non-Gaussian variable in any way. In typical applications, the correlation is only important if the two values are returned sequentially -- so if one simply delays the return randomly, the problem can be removed... but the log calculation is still very slow on an Intel chip.

Yes... I don't wish to get stuck forever on the random issue, but I do want to learn about the calculus of doing the transformations; That part goes beyond random number generation and into error propagation when Gaussian variables and measurement of data are operated on mathematically. Surprisingly, (for example) the multiply operation on two Gaussian variables does not produce a true Gaussian -- and the typical error propagation formula found in standard texts fails under certain conditions...

I could develop an emprical way of dealing with the issue -- but I'd prefer to be *able* to analyze it more robustly. Therefore I do need to understand how operations/polynomials/etc. transform a random distribution....
That's my ultimate goal.

I can see why you think that -- but there is an alternative; One can simply have an array of four numbers, and then subtract the first number off the last result and add a new final number randomly generated. The new result only depends on the history -- plus one new random number. Also -- Randomly flipping the sign of the result, or of two of the numbers in the sum can prevent any averaging effect to show up between adjacent results that are returned.

Sign flipping only consumes one random bit .. and extra bits can be saved between generator calls. etc. etc.

Just as a note; Some literature I have seen talks about using an array of "seed" Gaussian values and then doing vector operations on them to produce a derived set.
(SSE or Altivec type programs).

This second set is then, only partially sampled (some values thrown away) -- and the process repeated. It's used in some super computing platforms because vector operations are cheap -- and everything else *very* expensive....

Thanks, I'm sure that will help. I have heard his name.... but never saw the article !

Yes, that's about what I have noticed too... but without knowing how to do the calculus on transformations, I'm not going to be able to succeed there...

as for fun: :!!) I am... !!

13. Jun 17, 2012

### viraltux

I disagree on this one. If I understand you correctly you're saying you just found some kind of correlation in the Box-Muller procedure that breaks somehow the independence of the two variables, right? You prove that and you become the new King... Long Live the King

Why would this be surprising?

This linear relationships to get the next number are common in algorithms to generate RAND numbers, but we are talking about generating a Gaussian once you have the RAND and going recursive linear again would be an overkill, even assuming it works, which I can't see clearly it would.

Besides, sign flipping is not the issue, the issue would be to know when you have to flip the sign; for that you need another good random bit generator which, in turn, would just add more overhead to the process.

Then welcome to PF! I think we need some of that here :tongue:

Last edited: Jun 17, 2012
14. Jun 17, 2012

### haruspex

Fwiw, I've tried another approach.
The idea is to map the range (0,∞) down to (0,1):
Given target PDF f(x), map x to z as x = g(z), g(0)=0, g(1) = +∞, g monotonic on (0,1).
h(z) = f(g(z)).g'(z)
Thus h(z).dz = f(x).dx.
The procedure is then:
- Pick z, y from uniform distributions on (0,1), (0,K) respectively.
- If h(z) > y return g(z), else pick afresh.
We need the following to be true:
- h and g are speedily computed
- h(z) <= K for all z in (0,1) (so set K = this max)
- Integral of h over (0,1) (= I =∫f(x).dx for x > 0) is a largish fraction of K

E.g. g(z) = 1/(1-z) - 1
h(z) peaks at z = 0.5 with h(0.5) = 4.e-0.5, approx 2.43.
Integral of f(x) for x > 0 = sqrt(π/2). This makes the 'occupancy' I/K = 0.52. Not great, but perhaps good enough.
I tried it and it certainly does produces half a bell curve. The remaining problem is to flip half of the returned values to negative x.

15. Jun 19, 2012

### andrewr

haruspex,
That looks very interesting -- I'll see if I can code it (If I understand it correctly), and see how it does. Once I have it running -- I'd like to examine the math you did to get the result, too; and compare it against the monty python version I came up with.

As an aside, I did another one of those analytical problems that perform operations on Gaussians -- in this case I did N(0,1) + N(0,1) but with the proviso that *only* values with the same sign were added (100% Cyhelsky skew correlation...); The result was:

$\overline { \sigma }= \sqrt{ 2 + { 4 \over \pi } } \approx 1.809209646$

And that is correct, by numerical simulation also.

The pearson covariance for the same constant is just:
${ 2 \over \pi } \approx 0.636619772$

I am curious, this is a sort of auto-correlation that I haven't found referenced anywhere online -- but I have found it very useful (as an approximate decimal constant up until now -- since I couldn't solve it analytically before!) and there is only one author who seems to be doing work in the same area -- A Czech by the name of Cyhelsky; but his work isn't online except the reference at Wikipedia.

What would this constant be called, eg: I'm thinking it's the auto-correlation constant of Gaussian noise? but is there a systematic name for constants like this one?

No use inventing a name if one already exists....

16. Jun 21, 2012

### andrewr

A picture is worth....

As far as the Monty Python method, it is only slightly faster than the pipe-lined box muller method.... Raw uncorrected Test on 100million samples:

MPM = 37 secs
BM = 47 secs

The raw version, without tail correction -- has a small but visible defect in the tail.
Picture #1 (Which messes up calculations...)

When I do a 4 sum correction, as mentioned before -- the speed is slightly slower than the pipelined BM; but the tail errors do go away.
Picture #2

However, even in picutre #2 -- the effects of adjacent samples are still a problem. I would need to shuffle results so as not to output them sequentially.
eg: Neither BM or MPM are optimal solutions..

This time, I hit a snag of a different kind:
What I found is that the conditional statements ("if") and chains of computations are bad combinations resulting in CPU pipeline disruptions... *sigh*

Hence, The only real speedup possible is by improving the random generator; that can definitely make MPM faster than BM, but it isn't worth the time investment as the speedup isn't even 100%.

But I am beginning to think this may be a dead end ; The processes being discussed are highly CPU hardware sensitive...

I included your random generator in the computation code/text file, and it benchmarked at 149 seconds...

(Amazing, huh? I would have expected those simple equations to have done better,
but rejection always is very expensive time wise... ;) )

Thank you very much for the example, though, it does indeed produce a bell curve -- (exact) -- and the calculus you used to convert one mapping to another is more useful to me than the code in any event. It helps me understand what can/can't be done to transform equations.

#### Attached Files:

File size:
4.3 KB
Views:
84
File size:
4.4 KB
Views:
93
• ###### compute.txt
File size:
11.9 KB
Views:
85
Last edited: Jun 21, 2012
17. Jun 21, 2012

### andrewr

For the inversion method, they have source code called qnorm.c; and FYI in it they have several 5th degree polynomials (and higher!). The code would be exceedingly slow -- but yes, it would be accurate. They are using ratios of polynomials to produce the erf function; and not just erf of the normal(0,1), but of any std deviation and mu.

In R -- the actual random number generation code for rand_norm(); (snorm.c?),
has several methods besides inversion -- Kinderman's method from 1976, and Forsythe's method are included. I already have a later version of Kinderman's code (1977) with Monahan, instead, in my code. It is *much* shorter and just as accurate; but alas -- it's still slow.

Yes, I see some references to an issue -- but I don't quite understand it; perhaps when I have more time -- I'll get into that in more detail. The ziggaraut clearly has issues and is fairly complicated.

I'm wondering if there isn't a way to use sin() and cos() or perhaps a polynomial and sqrt() since these are far faster than Log, and might allow the BM method to be sped up significantly without much work.
Hmm....

Thanks again for the links, they were informative.

18. Jun 22, 2012

### viraltux

Well I said I doubted very much they use just polynomials cause the tails cannot be nicely approximated with them, so it seems R uses for the tails a ratio of polynomials which it is not a polynomial itself.

But I disagree that using polynomials makes the code "exceedingly slow" in fact polynomials are computationally very fast structures to calculate since you have a cost O(n-1) for nth degree polynomials. That means that most of the time the cost to calculate such function in R will be around O(4), which is great.

There are many performance techniques we can use to speed up algorithms, you can really do wonders if you're determine.

A family of these tricks are based on doing things like alter the algorithm in ways that mathematically makes the algorithm slower but taking into account the machine architecture actually it becomes faster. So for these techniques you have to know your machine.

Another family of techniques are based of breaking the code into pieces and then reuse calculations to avoid redundancy. For instance, if you need to calculate $sin(x)$ and $cos(x)$ (which yo do in BM) you do not want to actually use the functions $sin(x)$ and $cos(x)$, what you do is to break the code for $sin(x)$ and $cos(x)$ and merge it into a $sincos(x)$ function which will be twice as fast and, thus, will make your code much faster.

So it goes for the $log(x)$ function, for BM you have to remember that the $log(x)$ is only applied for numbers within (0,1] that means that using some $log(x)$ implementations is a waste of resources since higher orders of the polynomial adds virtually nothing to the accuracy. So let's say that $log(x)$ uses a polynomial of degree 10, since you know you will never calculate numbers higher than 1, you can create a function $log01(x)$ which only uses a polynomial of degree 5 as machine accurate as the $log(x)$ function but twice as fast.

So, I just explained a couple of tricks that can really speed up your code if you use them instead applying $sin(x)$ , $cos(x)$ and $log(x)$ right away.

And actually the fun goes ooooon... But you're fun, right? :tongue:

Last edited: Jun 22, 2012
19. Jun 26, 2012

### andrewr

Yes, you are undoubtedly correct. Exactly. I just didn't understand what you meant...

Not really .... In theory, yes -- but in practice, no.
"R" is using code from the 1960's-80's which was for CISC machines.
Cache latencies and piplining issues, were not yet part of the programming milieu;
Foss for example, has several *if* statements in the code deciding when to use which polynomial ... that's exactly what slowed my code for the MPM down horribly compared to my estimate of it's speed. (I'll explain later.)

On most modern machines, the brute force operations of addition and multiplication both have the same time delay, so it's more like 7 operations to do a 4th degree polynomial.

7 operations is about the same time as it takes to compute sin or cos to the full double precision on my machine.... But a 4th degree poly is nowhere near able to achieve the same precision.

If you look at the source code I already posted -- you'll notice that I use second and third degree polynomials to avoid computing exp(-0.5*x**2) as much as possible.
I use these polynomials > 98% of the time, compute exp() for the remainder.
I even did alignments for power of two operations where possible, since that doesn't even require a multiply instruction; but only an integer add...

For a particular number of repetitions -- I think it was around a billion -- I have time measurements for each of the math--co operations on my machine.

multiply and divide : 2 seconds (or less)
sin() and cos(): 15 seconds
sqrt(): 36 seconds.
exp(): 46 seconds.
log(): 87 seconds.

In theory, then, I expected my program to 98% of the time to do polynomials at a weight of *less* than 8 seconds. And 2% of the time, to run exp at 46+2+2 ~ 50 seconds;
That works out to <9 seconds average...
Box muller (ignoring multiplies) would be 87+36+15=138 seconds.
(Which isn't far off from reality...)

Hence, the MPM method ought to be on the Order of 15 times fasterthan BM.
But it's nowhere *NEAR* that much faster -- it is, in fact, only *slightly* faster.

If anyone sees a coding mistake, let me know -- but I don't think there is anything wrong with the way I wrote the program.

The main issue for my machine, and these issues are *VERY* common in all desktop machines whether powerPC, or Intel, and even some ARM processors...
Is that failed branch prediction must break a very deep pipeline in the processors.
On an Intel Celeron, the pipeline depth is around 40. That means, every time an "if" statement changes the flow of the machine -- the CPU runs around 40 times slower than normal. A single *if* statement can be very expensive, time wise. A 2.5GHz processor is reduced to 62MHz for a brief moment... There is no way to program preferred branch directions, either, in the C language. It's all up the CPU's automatic discretion.... (if it has any bug free ability at all...).

The main thing I need to do is calculate a single exact formula to produce the Gaussian -- with no *if* statements included in the algorithm. These techniques you mention are all very good (I'm an EE, and although I do hardware -- I'm *very* familiar with these issues.)

Yes, I'm into fun. Lots of it.
I really do want to stay away from machine architecture dependent coding.
That means I can't achieve the "highest" performance -- but it also means I can share the code with anyone; I know from hardware design experience that the relative timing of my machine's math co. hold for many other processors as well.
Sin and Cos don't appear to be implementable by any method I know which will be faster using polynomials, than just using the built-in function. Log(), however, could easily be replaced -- and I have some experimental (non-polynomial) formulas that might be well worth curve fitting....

I'm learning a bunch about how a uniform random plugged into a formula produces a pdf ! eg: f( u1 ) --> a pdf of --> d/dx f^-1( x ).
This appears to be key in designing a more easily computed Gaussian generator.

There is much that I am studying in BoxMuller, some of which is just for knowledge -- other parts explorations for speed-up.

One fairly interesting path is to consider sin and cos: Since a function expressed as sin() is at 90 degrees to one expressed with cos(); I was thinking it may be possible to use this relationship to compute an anti-function of something expressed as sin(f(x)).

exp(-0.5*x**2) is slow to compute; but when I graph acos( exp( -0.5*x**2 ) ), The resulting graph appears to be a hyperbola of some kind that would be *VERY* easily computed using only multiplication, division, and addition. Hence, it would be around 2.5 times faster than computing the exponential itself, to compute the cosine of a different function... (And it may be *MUCH* faster than computing the anti-exponential directly as BoxMuller does...!)

Does anyone happen to know what the formula of the hyperbola is for acos( exp(...) ) ?

20. Jun 26, 2012

### haruspex

Doesn't look like that to me. There's a right-angle at the origin (gradient -1 to the left, +1 to the right), and an asymptote at y = π/2.
I guess if you look at x > 0 only, it does resemble a hyperbola. Taking the gradient near the origin to represent the other asymptote, the angle between asymptotes on the same branch is 3π/4. So the hyperbola should look like y2 = tan2(3π/8)x2 - c, rotated clockwise by 3π/8 and suitably translated.