Expectation value for CDF fails

andrewr · Jun 4, 2012

Hi,

(Sorry for the slight misnomer in the title... I can't edit it!)

I'm doing several problems to compute the expectation value and variances of sub-samples & operations on the normal distribution; and I am having trouble getting results that agree with numerical simulations.

I have several Gaussian random generators written in C, and am using vectors of length 1million to 10billion to do numerical experiments; all generators appear to converge accurately to at least 4 decimal places with the sample sizes I am using. For example in μ(σ) notation, adding 0(1) + 0(1) gives 0(1.414), exactly as expected.

The random generator is crude, but replacement of it doesn't change the result.
Program is attached...

So, for my first (trivial) problem (and hopefully the same mistake is in my other questions!):
When I do Ʃ |0(1)|/n, I get 0.8933.

However, if I try to solve the latter problem analytically, I get a different result.
Could someone point out where I have made a mistake, please?? :)

I think the pdf for a Gaussian is:
[tex]p(x)= \sqrt{1 \over 2\pi}e^{-{1\over2}x^2}[/tex]

Since abs() only uses Half the Gaussian curve, the pdf ( w/ domain [0,∞) ) becomes:
[tex]p(x)= 2\sqrt{1 \over 2\pi}e^{-{1\over2}x^2}[/tex]

To compute the expectation value, all I ought to do is solve:
[tex]\overline{X}=\int\limits_0^{∞}x \times p(x)[/tex]

[tex]\overline{X}=2\sqrt{1\over2\pi}(\left.-e^{-0.5x^2}\right|_0^{∞})=2\sqrt{1\over2\pi}≈0.7979[/tex]

Which is, of course, wrong...
So, what did I do that messed the result up?

haruspex · Jun 5, 2012

I believe your algebra is right and your program wrong.
I set up a spreadsheet using RAND() and the Box-Muller method to generate sets of 600 random values with N(0,1) distribution. The averages of the absolute values hovered around .75 to .85.

andrewr · Jun 5, 2012

haruspex said:

I believe your algebra is right and your program wrong.
I set up a spreadsheet using RAND() and the Box-Muller method to generate sets of 600 random values with N(0,1) distribution. The averages of the absolute values hovered around .75 to .85.

Hmm...
I just wrote a small test program in Python to double check, as it uses a ratio of uniform [0,1) randoms to generate a Gaussian -- and it does indeed come up with 0.7978 for 10Million vectors.
So it does have to be my program.

Thanks.

Edit:
Ohhhh! I just found it, I was still taking the root of the answer and computing the standard deviation -- not the mean/expectation...
Sigh... Thanks again, and very much!

--Andrew.

andrewr · Jun 10, 2012

Unfortunately, my other mistakes are not in fact related...

Side note: For those downloading the program out of curiosity, there are bugs in it. (The rejection method needs to be multiplied by 1/sqrt(2 pi). etc.

I have found & written much better Gaussian random number generators; and I'd rather share good code, than leave broken code around in the near future: I am attempting 1 trillion computation calculation experiments, now, and it is becoming more important to SEIRIOUSLY optimize the generator for speed AND accuracy. I'm sure some of you would like to do the same -- but easily... I'd like to help.

To my surprise: The Box muller method really isn't that fast! although it is reasonably accurate. Computing the logarithm function appears to be the slowest possible math function for doing Gaussian generation -- and (to my surprise) almost all *accurate* random number generators use it on every computation.

I have come up with a much better method (and of course, found out someone else discovered it before me...) but I am re-writing it to avoid the logarithm function as much as possible. This gives a 10x speed improvement on my machine -- so it IS worth chasing after...

On to my mistake ... This one has to be analytical:
I am attempting to compute the standard deviation of two N(0,1) distributions added together -- and the answer is sqrt(2). I'm not sure where I lost the constant, but I get sqrt(4) -- and I am thinking, maybe when I did a change of variables -- I forgot something. I just looked it up, and I am tired enough -- maybe I am just blind to what step I mis-copied.
The probability of each data element is:
[itex]p(x)={1 \over k}(e^{-{1\over2}x^2}e^{-{1\over2}y^2})[/itex]
[itex]k=\int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty } e^{-{1 \over 2}(x^2+y^2)}dxdy=\pi[/itex]

The variance for addition is:
[itex]\sigma^2(x,y)=(x+y)^2[/itex]

Which makes the weighted variance:
[itex]p(x,y) \sigma^2(x,y) = {1 \over \pi}(x+y)^2e^{-{1 \over 2}x^2}e^{-{1 \over 2}y^2}={1 \over \pi }(x+y)^2e^{-{1 \over 2}(x^2+y^2)}[/itex]

So the combined equation yields:
[itex]\overline{ \sigma ^2 }= \int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty }{1 \over \pi }(x+y)^2e^{-{1 \over 2}(x^2+y^2)}dxdy[/itex]

And I think this is where I might have messed up? I needed a change of coordinate systems/variables...
[itex]\overline{ \sigma ^2 }= \int \limits_{ 0 }^{ 2 \pi } \int \limits_{ 0 }^{ \infty }{1 \over \pi }(r^2+2r \times cos( \theta ) \times r \times sin( \theta ) )e^{-{1 \over 2}r^2}r \times dr d \theta[/itex]
[itex]\overline{ \sigma ^2 }= {1 \over \pi }\int \limits_{ 0 }^{ \infty }\int \limits_{ 0 }^{ 2 \pi } (1+ 2cos( \theta ) sin( \theta ) )r^3e^{-{1 \over 2}r^2} \times d \theta dr[/itex]
[itex]\overline{ \sigma ^2 }= {1 \over \pi }\int \limits_{ 0 }^{ \infty } \left. ( \theta - {cos^2( \theta )} ) \right|_{ 0 }^{ 2 \pi } r^3e^{-{1 \over 2}r^2} \times dr[/itex]
[itex]\overline{ \sigma ^2 }= {1 \over \pi }\int \limits_{ 0 }^{ \infty } 2 \pi \times r^3e^{-{1 \over 2}r^2} \times dr[/itex]
[itex]\overline { \sigma ^2 }= 2 \left. ( -e^{-{1 \over 2} r^2} (r^2+2)) \right|_{ 0 }^{ \infty }[/itex]
[itex]\overline { \sigma ^2 }= 4[/itex]

Sigh, It must be simple -- but I don't see it...

haruspex · Jun 10, 2012

[itex]k=\int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty } e^{-{1 \over 2}(x^2+y^2)}dxdy=2\pi[/itex]
Wrt methods for generating N(0,1) values from the uniform distribution of RAND, I've read that simply averaging batches of a dozen or so gives good enough results for most purposes, and much faster. Haven't tested that.

andrewr · Jun 10, 2012

haruspex said:

[itex]k=\int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty } e^{-{1 \over 2}(x^2+y^2)}dxdy=2\pi[/itex]
Wrt methods for generating N(0,1) values from the uniform distribution of RAND, I've read that simply averaging batches of a dozen or so gives good enough results for most purposes, and much faster. Haven't tested that.

I copied that result directly from a table... Figures, I didn't notice that it was an integral of exp( x^2 + y^2 ) and not 1/2 of that...
Thanks for releasing me from mathematical hell...

Regarding the generation of N(0,1);
I have experimented with that, and I'll point out some issues -- and future directions I plan on testing...

When I benchmark general operations inserted in measurable places within the program I submitted, and including any penalities from OS exceptions, etc. I come up with the following approximate times for 1 billion iterations.

*, or + of doubles = indistinguishable from 1 or 2 seconds.
exp() --> 46 seconds
sin(),cos() --> 15 seconds.
sqrt() --> 36 seconds.
Rand01() --> 60 seconds.
log() --> 1 minute 27 seconds.

Generation of decent random numbers isn't a really fast operation; I improved the generator I posted before to use many less operations, to fit in a CPU cache (well), remove IO and Unix dependencies... and it still wasn't fast...

To average 12 of them would be far slower than using the log() function...
I plan on exploring the Merzenine (sp?) twister core generator to see if it is much faster, but I doubt it.

I tried a three sum RAND() generator, and it produced significant errors; and was already slow.

There are a handful of Gaussian generation techniques that I find in the literature...
Trimming by rejection of either a pure Rand function, or a distorted version of it;
Addition of normalized (standard deviation scaled) RAND numbers -- because sums of Gaussians are also Gaussians...
Piecewise linear selection of random or "binned" values (Ziggaraut method), and several other hybrids...
And transformation based generators.

wrt: summing RAND()'s, I think it the addition of normalized RAND's might be fast if applied to say, a triangle distribution, by summing two RAND() values. That is something I will probably try.

The transformation class of generators is the one I believe can be made be the fastest. I haven't figured out how to code the Ziggaraut, but I would like to try it -- for that is the only serious contender that I can see by estimation of the operations used.

The main problem is that CPU Cache times are non-intuitive on the newer processors, and it isn't trivial to find out what is causing slowdowns on memory access bound routines. The zigaraut method relies on tables in memory, just as my random generator does. This is typically a blazingly fast method of computation -- but to my surprise, it really isn't in the routines I bench-marked. So, although some sites claim the Ziggaraut to be the fastest routine -- I am wondering if that is a hand optimized version, or a generic version. eg: I am not wanting to produce CPU specific code, so I would rather test code that is only written at the C level.

I found a generator very close to the idea I have for a transformation -- it is called (of all things, as I like Python...) the Monty Python method...

But the simplified description makes it clear that it too uses either the log function or the exp() function for a very large percentage of the operations. (No actual code was available in the descriptions I read.)

In my idea, I can reduce that percentage massively -- so that only 5% or less of the samples will use the exp() function -- (and perhaps twice...); so that I hope to target the add/multiply at two seconds with 6% of the exp time as a penalty (around 3-4 seconds.) I figure that's the best I am capable of coding with what I know...

We'll find out soon!

andrewr · Jun 15, 2012

Let the real fun begin -- I got stuck, (of course), in the tail region of the Gaussian random generator I am making. It is going to be quite fast and accurate regardless of whether I fix a small error in the tail region, but I have a question about how to correct it in a certain way...

I'm attempting to make this post readable by breaking it up into chapters. I also attached a graph to this post which I will refer to constantly -- and unless you're blind, it will be extremely helpful explaining obscure sentences... (I did my best to be clear, even if you are blind... ;) )

The question itself can be found by looking at the chapter "-------" headings.

-------------------------------------------------------------
General/generic background

All generators of random distributions produce random numbers in proportion to the differential area under the curve of choice. So, in principle, one can locate a point "inside" the distribution by choosing two uniform random numbers -- one for the y-axis and one for the x axis, and then seeing if the y value is "under" the curve. (AKA the x,y pair is found in the area under the curve.)

This method forces every point in the randomized uniform rectangle to have equal likelihood -- and since the number of points ( any y < curve's_y(x) ) making up the vertical line under the curve (at each x) is equal to the probability of choosing that x value by random chance, that's how often one should accept an x value that was uniformly chosen at random.

Therefore, all x,y based generators accept an x value *iff* a random y value is "under" the curve's y(x) value.

The easiest generation of the probability distribution, is to randomly select an x,y pair and throw away (reject) values until an x,y pair is found *inside* the area of the distribution curve. (AKA found Under the curve.) This final value is the only one returned.

This method has one unfortunate consequence when doing Gaussian generation: The number of rejects is *very* large.

Since the reject area outside a Gaussian grows rapidly as x values go beyond the 3 sigma mark (eg: the y of the curve becomes tiny) -- that means all practical Gaussian rejection generators must truncate the maximum x value they will randomly choose (I used 8 sigma as the limit in my last program...). This of course introduces error, but keeps speed of the program from being infinitely slow in picking a number...

The Monty Python method of generating the Gaussian distribution uses the x,y technique -- but attempts drastically reduce the rejection area outside the Gaussian distribution (eg: any random_y(x) > Gaussian_y(x) points).

The way the M.P,M does this, is to choose a fixed randomizer x,y rectangle which doesn't enclose the entire Gaussian curve's y values; Rather the rectangle excludes regions where acceptable y values exist, and hence can never be picked. BUT those excluded areas are re-mapped into areas inside the randomizer rectangle which are rejected otherwise. Hence reject points for one part of the Gaussian become acceptable points in another part -- only re-mapping of the x value is required.

Even then, however, compromises are required for computation speed since not all areas can be mapped into the box efficiently by simple translations and rotations. These compromises inevitably leave some reject areas inside the randomizer rectangle.

But, *if* the area of the randomizer rectangle is made exactly equal to the total area enclosed under the Gaussian -- then any reject area *inside* the x,y rectangle must exactly equal all areas not packed inside that rectangle and generally speaking, this happens to be the "tail" area previously mentioned...

----------------------------------------------------------
How I chose to partition the Gaussian for a M.P.M approach.

In the graph, there is a blue rectangle which has the same total area as the Gaussian curve it maps; call this the areaBox; now notice at all small x values (x<spin), there is an area under the Gaussian, but still outside the areaBox -- this area (a0) is clearly un-pickable by x,y pairs inside the areaBox.

Since area a0 touches the top of the areaBox, the area just below it (a1) has no x,y values that outside the areaBox, nor values above the Gaussian -- therefore these points are never to be rejected, and this area always returns without any curve computations (VERY fast.)

Immediately to the right of a1 (eg: x>spin and x<2spin) the Gaussian curve is inside the areaBox, and also divides it vertically into an accept and reject region. The accept region is a2, and the reject is anything above a2.

To remove the rejection area above a2 as much and as efficiently possible (speed is an issue), it is filled by simply rotating area a0 by 180 degrees into translated area a0r; (the rotation is around point x=spin, y=top_of_areaBox ).

Because a0's x domain is from 0 to spin, the area when rotated will start at x=spin and end at x=2*spin, the latter x always happens to fall short of the maximum x value of the areaBox -- which makes packing imperfect.

The point x=spin is an arbitrary constant, but it has two issues affecting what value is best; it must be less or equal to 1 and it determines the height of the areaBox. For maximum usage of the areaBox, spin needs to be as large as possible -- but there is a tail curve fitting trade off which makes "spin" values in the range range of 0.95 to 0.98 very desirable. Spin may never be greater than 1.0, because a Gaussian curve has an inflection point there -- and therefore greater values will cause a2 to overlap with a0r;

Once this x=spin is chosen, the two areas a2 and a0r (found in region x=spin to x=2spin), will still have a y oriented reject area between them called e0.
The requirement that spin<1 also guarantees the existence of a small un-packed area to the right of e0 which ends at the edge of the areaBox. This latter region is beneficially less than 5% of the x values in the areaBox.

Hence, by virtue of area conservation -- the rejection areas of e0 and of areaBox x>2spin, must exactly add up to the tail region found immediately after x=2spin and extending to infinity. This tail is troublesome to implement -- but I discovered something useful:

-------------------------------------------------------------------------------------------------------------
A very useful discovery regarding the problematic tail region and it's efficient generation...

If useless area e0 is mirrored across s=2spin, (as drawn), such that x values in e0 become mirrored x values in e0m -- it reduces the tail area to the cyan colored region; And this tail region, if graphed without the e0m below it -- happens to be almost exactly another Gaussian curve...

Recursively calling my version of the Monty Python random Gaussian generator, will produce this tail perfectly -- with all values of x to infinity "potentially" returned (The recursive Gaussian x values add).

I curve fit the value of "spin" and a scale factor (s) to transform the mirrored x values (of em0) so that they maximize a Gaussian fit to the tail area.
The worst case error is around 0.6% (see the cyan error plot which is in percent...).

Since the em0 correction only extends a distance of ~1 spin, the correction ends at around x=3*spin --- and the remainder of the equation reverts to the tail of the *original* Gaussian, therefore this discontinuity is the cause of the maximum error. However the max error occurs roughly after the, 3 sigma mark which means this error is only in about 3 of 1000 random numbers; (but as the call is recursive, the uncorrected errors compound *very* slightly...)

Notice, however, the tail portion corrected by em0 (eg: with 0.2% error remaining) is executed far more often (around 3% of the generator calls) and is far more serious to minimize error in. I don't have an exact estimate of the total displacing area of all errors -- but it isn't very big. I am thinking < 0.2% * 4% = 0.8% (and perhaps quite a bit less because of exactly where it occurs...)

The exact anti-derivative of the tail region for the corrected em0 portion, is listed at the end of the post. It is invalid after the em0 portion...

------------------------------------------------------------------
Question: How to transform a perfect Gaussian distribution into the desired almost Gaussian required for the tail region to be correct...

I know that given an area formula for a distribution (anti-derivative) -- one can use a method called inversion to produce random samples from that distribution using a single uniform random number; It simply requires plugging a uniform random number into the anti-function of the anti-derivative.

What I need is a distribution generator equal to the tail area in cyan -- it isn't an exact Gaussian -- but VERY close to one.
Since I know the anti-derivatives of the tail region, analytically, And I know the anti-derivative of the ideal Gaussian -- I think there must be a way to convert the Gaussian into my tail distribution.

How do I do this? A generic discussion of converting any distribution into any other via calculus integrals would be very helpful (it doesn't need to be this problem here, just a *clear example* of how this might be done.) Is anyone able to help?

-------------------------------------------------------------------------------------
Appendix: Information to reproduce my graphs and equations...

The graph attached was produced with the following equations using gnuplot...
They may (of course) be used in other graphing programs that understand the C like ternary operator (condition)?(evaluate when true):(evaluate when false)
Gnuplot also does not return errors on divide by zero, but does not plot those points. I find that useful to limit where an equation is displayed...

n(x)=exp(-0.5*x**2)
serf(x)=erf(2**-0.5*x)

t(x)=area/n(spin)
areaBox(x)=(x < t(spin) ) ? n(spin):0

error(x)=(x>=0) && (x<=spin) ? 2*n(spin) - n(x) -n(2*spin-x) : 0
rot180(x)=(x>spin) && (x<=2*spin) ? 2*n(spin)-n(2*spin-x) : error(x-2*spin)
residual(x,s)=(x>=0)?( n(x+2*spin) - error(x/s)*s ):1/0
tail(x)=residual(x,s)/residual(0,s)

Note: semantically, residual computes the corrected tail region -- tail(x) is a slight misnomer, it actually scales the tail region to 1.0 to ease comparing it to a normal curve.
The following constants were the ones I thought made the final tail fit to a Gaussian nearly optimal...

s=1.02803051369197 scales the mirrored e0m's x values in the tail -- this adds a degree of curve fitting freedom. The scaling preserves area...
spin=0.966804869573191 This was the best value of spin to make the tail region as Gaussian as possible... and keep e0 relatively small (tradeoff issues).
stdS=0.609743713386175 The ideal Gaussian to be matched against in the tail region has a standard deviation in x which isn't unity -- this estimates the actual.

The area of the tail function in the corrected em0 region only, ought to be (barring mistakes...):
[tex]\int n(x')-e_{0m}(x') = \sqrt{\pi \over 2}( serf(x+2spin) + serf(x) - serf(2spin-x) )-2x \times n(spin)[/tex]

After this region, it simply reverts to the tail of the original Gaussian, and is just plain old erf() scaled and translated appropriately...

viraltux · Jun 16, 2012

haruspex said:

[itex]k=\int \limits_{- \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty } e^{-{1 \over 2}(x^2+y^2)}dxdy=2\pi[/itex]
Wrt methods for generating N(0,1) values from the uniform distribution of RAND, I've read that simply averaging batches of a dozen or so gives good enough results for most purposes, and much faster. Haven't tested that.

Well, this is a good example that we cannot trust everything we read in the Internet

This method is not faster but actually horribly slow, the RAND generation is actually one of the bottle necks in this kind of situations; generating a good random number is expensive so the ratio of RAND numbers per distribution number should ideally be 1/1, in some difficult situations you might settle for 2/1, but a ratio of 'dozen or so'/1 is by no means reasonable. The moment you read something like that there is no need to keep reading (or keep reading with a raised eyebrow). And not just that, in this approach speed is not the only problem, accuracy is terrible as well since it entirely relies on the Central Limit Theorem... so yeah.

viraltux · Jun 16, 2012

andrewr said:

Question: How to transform one distribution into a very slightly disturbed distribution in order to correct the error...

I know that given an area formula of a distribution (anti-derivative) -- one can use a method called inversion to produce random samples from that distribution using a single uniform random number; It simply requires plugging the uniform random into the anti-function of the anti-derivative.

What I need is a distribution equal to the tail area in cyan -- it isn't an exact Gaussian -- but VERY close to one.
Since I know the anti-derivatives of the tail region, analytically, And I know the anti-derivative of the ideal Gaussian -- I think there must be a way to convert the Gaussian into my tail distribution.

How do I do this? A generic discussion with a simple worked example would be very helpful (it doesn't need to be this problem here, just a clear example of how this might be done.) Is anyone able to help?

Hi Andrew,

The two fastest approaches that I know of to generate random normal distributions are the inverse method and the Box-Muller. The Ziggurat method, though mathematically sound, when implemented in parallel computing you get an increasing divergence proportional to the number of threads; a good example for the wisdom "theory is better than practice in theory, but practice is better than theory in practice"

So since you are trying to work out a Ziggurat like algorithm maybe you want to check this paper first (Section 5.4): http://www.mendeley.com/research/state-of-art-of-heterogeneous-computing/

The inversion method relies in a very efficient coding with machine accurate results of the inverse normal probability function. This is the method used by default in R (which IMHO says something about it)

Box-Muller is also good since you have a ratio RAND/Normal of 1/1 and there are specific machine develop methods (GPU, math co-processor, etc... ) to calculate sin, cos and exponentials of imaginary numbers very efficiently.

andrewr · Jun 16, 2012

viraltux said:

Hi Andrew,

The two fastest approaches that I know of to generate random normal distributions are the inverse method and the Box-Muller. The Ziggurat method, though mathematically sound, when implemented in parallel computing you get an increasing divergence proportional to the number of threads; a good example for the wisdom "theory is better than practice in theory, but practice is better than theory in practice"

So since you are trying to work out a Ziggurat like algorithm maybe you want to check this paper first (Section 5.4): http://www.mendeley.com/research/state-of-art-of-heterogeneous-computing/

I read through the paper; The application I have is just basic experimentation -- and I am actually more interested in the analytical solutions which a mathematician would come up with than the software; My need for moderate software speed is just to do verification of what I come up with analytically... but I'd like to run tests that have around a trillion samples; that means with the generators I have available -- that takes close to a week of single CPU time.

I'd love to use Xilinx Vertix 5 chips -- but that is out of budget for a hobby project; and the older 4K series of xilinx chips are way too slow (I still have some...).
Also, as this is a one shot study -- buying GPU's is also a bit of a waste (I am here to keep in practice, more than to solve a problem with $ attached.)

The inversion method relies in a very efficient coding with machine accurate results of the inverse normal probability function. This is the method used by default in R (which IMHO says something about it)

Yes, I am sure the inverse method is very accurate -- however, the inverse of erf() has no closed form -- so I am not sure how the polynomial curve fits would have been developed for the inverse function used in R that are machine accurate AND fast.
I would be interested in looking at such polynomials, and to experiment with them -- but I don't know how I would create them from scratch in any reasonable time frame.

Box-Muller is also good since you have a ratio RAND/Normal of 1/1 and there are specific machine develop methods (GPU, math co-processor, etc... ) to calculate sin, cos and exponentials of imaginary numbers very efficiently.

Yes, but on Intel CPU's (which is a big portion of the audience here...) -- the Log function is often the slowest math operator possible on the math - co-processor enabled math library (gcc/visual studio C libraries.). Doing just a basic benchmark as I did in an earlier post is a sample of what I have come across. Intel does not appear to have a very efficient hardware implementation of ln() on most of their processors.

As for box-muller, The reference implementation I gave in the program at the start of the thread actually requires TWO RAND's for 1 result. Although some people tout the orthogonal nature of sin and cos to produce two results -- in practical applications, their relationship seriously skews the result of calculations... Also, it requires one log operation each run -- and on Intel that is very slow.

The article made some excellent points regarding random number generators needing to use the word size of the CPU/computer; I used byte for simplicity, and that has memory access (copying time) penalties which slow the random generator down significantly.

Since I am not using parallel computing -- I am not sure what the issue concerning divergence really is; but I would be interested in your comments -- is this a synchronization issue between thread outputs of random generation?

For my purposes, I want to learn about the calculus involved in warping one random distribution into another -- and figure out ways to test a Generator's overall accuracy.
I figure that if I keep a running list of 4 Gaussians made by my generator -- that linear combinations of these results will have *significantly* improved properties;
since linear operations are so fast as to have almost no penalty -- that is a practical solution to my generator's problem... but I'd like to go farther than the knee-jerk pragmatic solution just for knowledge's sake.

Thanks for the link,
--Andrew.

viraltux · Jun 16, 2012

andrewr said:

Yes, I am sure the inverse method is very accurate -- however, the inverse of erf() has no closed form -- so I am not sure how the polynomial curve fits would have been developed for the inverse function used in R that are machine accurate AND fast.
I would be interested in looking at such polynomials, and to experiment with them -- but I don't know how I would create them from scratch in any reasonable time frame.

I doubt very much they are using simple polynomials to approximate the inverse but, luckily, R is a FOSS project which means you can hack the code all you want: http://cran.r-project.org/sources.html

andrewr said:

As for box-muller, The reference implementation I gave in the program at the start of the thread actually requires TWO RAND's for 1 result. Although some people tout the orthogonal nature of sin and cos to produce two results -- in practical applications, their relationship seriously skews the result of calculations... Also, it requires one log operation each run -- and on Intel that is very slow.

If you get skewed data you're doing something wrong, with Box-Muller two RAND give you two independent normally distributed values.

andrewr said:

Since I am not using parallel computing -- I am not sure what the issue concerning divergence really is; but I would be interested in your comments -- is this a synchronization issue between thread outputs of random generation?

I haven't seen the details of the Ziggurat algorithm parallel implementation myself, but I don't think it is a sync issue but rather something related to the inner complexity of the Ziggurat algorithm itself. Maybe you can find more info in the paper's references.

andrewr said:

For my purposes, I want to learn about the calculus involved in warping one random distribution into another -- and figure out ways to test a Generator's overall accuracy.
I figure that if I keep a running list of 4 Gaussians made by my generator -- that linear combinations of these results will have *significantly* improved properties;
since linear operations are so fast as to have almost no penalty -- that is a practical solution to my generator's problem... but I'd like to go farther than the knee-jerk pragmatic solution just for knowledge's sake.

Thanks for the link,
--Andrew.

Well, if you care more about the theory than the practice that puts Ziggurat like algorithms back on the table and the conversation shift entirely to the math side.

I don't think though the linear combination is a good idea. You're right that the linear combination has virtually no penalty but you have to think that you will have to generate 4 numbers to get only one with improved properties and that defeats the whole purpose; by doing so you are making your algorithm around 4 time slower that it should be.

Oh wow, I was going to recommend you to check the work of Mr. Marsaglia http://en.wikipedia.org/wiki/George_Marsaglia and I just found out he passed away last year! Well... Long life the King. That guy actually was the man when it came to randomness and computing.

Anyway, there are lots of bibliography on this subject, so you will not get short of it.

You know, whatever the method you see around they always have two steps; firstly to generate random numbers from 0 to 1 and secondly to use those numbers to generate a particular distribution. Back when I was working with random numbers/generators I thought about the possibility to create an algorithm that would do both at the same time, that is, an algorithm that directly generates a Normal distribution... Well, easier said than done but I still think that this approach might prove fruitful.

Anyway, have fun

andrewr · Jun 17, 2012

viraltux said:

I doubt very much they are using simple polynomials to approximate the inverse but, luckily, R is a FOSS project which means you can hack the code all you want: http://cran.r-project.org/sources.html

Excellent, I'll look there.

If you get skewed data you're doing something wrong, with Box-Muller two RAND give you two independent normally distributed values.

Yes and no. There is a relationship between them that they both always have the same "radii"; so that if one computes the square of the two "independent" variables and adds them, the correlation with the NON-Gaussian radius becomes part of the result. In most applications this is irrelevant -- but not always. In a true independent generation, the square of the results will not correlate with a non-Gaussian variable in any way. In typical applications, the correlation is only important if the two values are returned sequentially -- so if one simply delays the return randomly, the problem can be removed... but the log calculation is still very slow on an Intel chip.

Well, if you care more about the theory than the practice that puts Ziggurat like algorithms back on the table and the conversation shift entirely to the math side.

Yes... I don't wish to get stuck forever on the random issue, but I do want to learn about the calculus of doing the transformations; That part goes beyond random number generation and into error propagation when Gaussian variables and measurement of data are operated on mathematically. Surprisingly, (for example) the multiply operation on two Gaussian variables does not produce a true Gaussian -- and the typical error propagation formula found in standard texts fails under certain conditions...

I could develop an emprical way of dealing with the issue -- but I'd prefer to be *able* to analyze it more robustly. Therefore I do need to understand how operations/polynomials/etc. transform a random distribution...
That's my ultimate goal.

I don't think though the linear combination is a good idea. You're right that the linear combination has virtually no penalty but you have to think that you will have to generate 4 numbers to get only one with improved properties and that defeats the whole purpose; by doing so you are making your algorithm around 4 time slower that it should be.

I can see why you think that -- but there is an alternative; One can simply have an array of four numbers, and then subtract the first number off the last result and add a new final number randomly generated. The new result only depends on the history -- plus one new random number. Also -- Randomly flipping the sign of the result, or of two of the numbers in the sum can prevent any averaging effect to show up between adjacent results that are returned.

Sign flipping only consumes one random bit .. and extra bits can be saved between generator calls. etc. etc.

Just as a note; Some literature I have seen talks about using an array of "seed" Gaussian values and then doing vector operations on them to produce a derived set.
(SSE or Altivec type programs).

This second set is then, only partially sampled (some values thrown away) -- and the process repeated. It's used in some super computing platforms because vector operations are cheap -- and everything else *very* expensive...

Oh wow, I was going to recommend you to check the work of Mr. Marsaglia http://en.wikipedia.org/wiki/George_Marsaglia and I just found out he passed away last year! Well... Long life the King. That guy actually was the man when it came to randomness and computing.

Anyway, there are lots of bibliography on this subject, so you will not get short of it.

Thanks, I'm sure that will help. I have heard his name... but never saw the article !

You know, whatever the method you see around they always have two steps; firstly to generate random numbers from 0 to 1 and secondly to use those numbers to generate a particular distribution. Back when I was working with random numbers/generators I thought about the possibility to create an algorithm that would do both at the same time, that is, an algorithm that directly generates a Normal distribution... Well, easier said than done but I still think that this approach might prove fruitful.

Anyway, have fun

Yes, that's about what I have noticed too... but without knowing how to do the calculus on transformations, I'm not going to be able to succeed there...

as for fun: :!) I am... !

viraltux · Jun 17, 2012

andrewr said:

Yes and no. There is a relationship between them that they both always have the same "radii"; so that if one computes the square of the two "independent" variables and adds them, the correlation with the NON-Gaussian radius becomes part of the result. In most applications this is irrelevant -- but not always. In a true independent generation, the square of the results will not correlate with a non-Gaussian variable in any way. In typical applications, the correlation is only important if the two values are returned sequentially -- so if one simply delays the return randomly, the problem can be removed... but the log calculation is still very slow on an Intel chip.

I disagree on this one. If I understand you correctly you're saying you just found some kind of correlation in the Box-Muller procedure that breaks somehow the independence of the two variables, right? You prove that and you become the new King... Long Live the King

andrewr said:

Surprisingly, (for example) the multiply operation on two Gaussian variables does not produce a true Gaussian -- and the typical error propagation formula found in standard texts fails under certain conditions...

Why would this be surprising?

andrewr said:

I can see why you think that -- but there is an alternative; One can simply have an array of four numbers, and then subtract the first number off the last result and add a new final number randomly generated. The new result only depends on the history -- plus one new random number. Also -- Randomly flipping the sign of the result, or of two of the numbers in the sum can prevent any averaging effect to show up between adjacent results that are returned.

Sign flipping only consumes one random bit .. and extra bits can be saved between generator calls. etc. etc.

Just as a note; Some literature I have seen talks about using an array of "seed" Gaussian values and then doing vector operations on them to produce a derived set.
(SSE or Altivec type programs).

This linear relationships to get the next number are common in algorithms to generate RAND numbers, but we are talking about generating a Gaussian once you have the RAND and going recursive linear again would be an overkill, even assuming it works, which I can't see clearly it would.

Besides, sign flipping is not the issue, the issue would be to know when you have to flip the sign; for that you need another good random bit generator which, in turn, would just add more overhead to the process.

andrewr said:

as for fun: :!) I am... !

Then welcome to PF! I think we need some of that here

haruspex · Jun 17, 2012

Fwiw, I've tried another approach.
The idea is to map the range (0,∞) down to (0,1):
Given target PDF f(x), map x to z as x = g(z), g(0)=0, g(1) = +∞, g monotonic on (0,1).
h(z) = f(g(z)).g'(z)
Thus h(z).dz = f(x).dx.
The procedure is then:
- Pick z, y from uniform distributions on (0,1), (0,K) respectively.
- If h(z) > y return g(z), else pick afresh.
We need the following to be true:
- h and g are speedily computed
- h(z) <= K for all z in (0,1) (so set K = this max)
- Integral of h over (0,1) (= I =∫f(x).dx for x > 0) is a largish fraction of K

E.g. g(z) = 1/(1-z) - 1
h(z) peaks at z = 0.5 with h(0.5) = 4.e^-0.5, approx 2.43.
Integral of f(x) for x > 0 = sqrt(π/2). This makes the 'occupancy' I/K = 0.52. Not great, but perhaps good enough.
I tried it and it certainly does produces half a bell curve. The remaining problem is to flip half of the returned values to negative x.

andrewr · Jun 19, 2012

haruspex,
That looks very interesting -- I'll see if I can code it (If I understand it correctly), and see how it does. Once I have it running -- I'd like to examine the math you did to get the result, too; and compare it against the monty python version I came up with.

As an aside, I did another one of those analytical problems that perform operations on Gaussians -- in this case I did N(0,1) + N(0,1) but with the proviso that *only* values with the same sign were added (100% Cyhelsky skew correlation...); The result was:

[itex]\overline { \sigma }= \sqrt{ 2 + { 4 \over \pi } } \approx 1.809209646[/itex]

And that is correct, by numerical simulation also.

The pearson covariance for the same constant is just:
[itex]{ 2 \over \pi } \approx 0.636619772[/itex]

I am curious, this is a sort of auto-correlation that I haven't found referenced anywhere online -- but I have found it very useful (as an approximate decimal constant up until now -- since I couldn't solve it analytically before!) and there is only one author who seems to be doing work in the same area -- A Czech by the name of Cyhelsky; but his work isn't online except the reference at Wikipedia.

What would this constant be called, eg: I'm thinking it's the auto-correlation constant of Gaussian noise? but is there a systematic name for constants like this one?

No use inventing a name if one already exists...

andrewr · Jun 21, 2012

A picture is worth...

As far as the Monty Python method, it is only slightly faster than the pipe-lined box muller method... Raw uncorrected Test on 100million samples:

MPM = 37 secs
BM = 47 secs

The raw version, without tail correction -- has a small but visible defect in the tail.
Picture #1 (Which messes up calculations...)

When I do a 4 sum correction, as mentioned before -- the speed is slightly slower than the pipelined BM; but the tail errors do go away.
Picture #2

However, even in picutre #2 -- the effects of adjacent samples are still a problem. I would need to shuffle results so as not to output them sequentially.
eg: Neither BM or MPM are optimal solutions..

This time, I hit a snag of a different kind:
What I found is that the conditional statements ("if") and chains of computations are bad combinations resulting in CPU pipeline disruptions... *sigh*

Hence, The only real speedup possible is by improving the random generator; that can definitely make MPM faster than BM, but it isn't worth the time investment as the speedup isn't even 100%.

But I am beginning to think this may be a dead end ; The processes being discussed are highly CPU hardware sensitive...

Please note:
I included your random generator in the computation code/text file, and it benchmarked at 149 seconds...

(Amazing, huh? I would have expected those simple equations to have done better,
but rejection always is very expensive time wise... ;) )

Thank you very much for the example, though, it does indeed produce a bell curve -- (exact) -- and the calculus you used to convert one mapping to another is more useful to me than the code in any event. It helps me understand what can/can't be done to transform equations.

andrewr · Jun 21, 2012

viraltux said:

I doubt very much they are using simple polynomials to approximate the inverse but, luckily, R is a FOSS project which means you can hack the code all you want: http://cran.r-project.org/sources.html

For the inversion method, they have source code called qnorm.c; and FYI in it they have several 5th degree polynomials (and higher!). The code would be exceedingly slow -- but yes, it would be accurate. They are using ratios of polynomials to produce the erf function; and not just erf of the normal(0,1), but of any std deviation and mu.

In R -- the actual random number generation code for rand_norm(); (snorm.c?),
has several methods besides inversion -- Kinderman's method from 1976, and Forsythe's method are included. I already have a later version of Kinderman's code (1977) with Monahan, instead, in my code. It is *much* shorter and just as accurate; but alas -- it's still slow.

ls of the Ziggurat algorithm parallel implementation myself, but I don't think it is a sync issue but rather something related to the inner complexity of the Ziggurat algorithm itself. Maybe you can find more info in the paper's references.

Yes, I see some references to an issue -- but I don't quite understand it; perhaps when I have more time -- I'll get into that in more detail. The ziggaraut clearly has issues and is fairly complicated.

I'm wondering if there isn't a way to use sin() and cos() or perhaps a polynomial and sqrt() since these are far faster than Log, and might allow the BM method to be sped up significantly without much work.
Hmm...

Thanks again for the links, they were informative.

viraltux · Jun 22, 2012

andrewr said:

For the inversion method, they have source code called qnorm.c; and FYI in it they have several 5th degree polynomials (and higher!). The code would be exceedingly slow -- but yes, it would be accurate. They are using ratios of polynomials to produce the erf function; and not just erf of the normal(0,1), but of any std deviation and mu.

Well I said I doubted very much they use just polynomials cause the tails cannot be nicely approximated with them, so it seems R uses for the tails a ratio of polynomials which it is not a polynomial itself.

But I disagree that using polynomials makes the code "exceedingly slow" in fact polynomials are computationally very fast structures to calculate since you have a cost O(n-1) for nth degree polynomials. That means that most of the time the cost to calculate such function in R will be around O(4), which is great.

andrewr said:

I'm wondering if there isn't a way to use sin() and cos() or perhaps a polynomial and sqrt() since these are far faster than Log, and might allow the BM method to be sped up significantly without much work.
Hmm...

Thanks again for the links, they were informative.

There are many performance techniques we can use to speed up algorithms, you can really do wonders if you're determine.

A family of these tricks are based on doing things like alter the algorithm in ways that mathematically makes the algorithm slower but taking into account the machine architecture actually it becomes faster. So for these techniques you have to know your machine.

Another family of techniques are based of breaking the code into pieces and then reuse calculations to avoid redundancy. For instance, if you need to calculate [itex]sin(x)[/itex] and [itex]cos(x)[/itex] (which yo do in BM) you do not want to actually use the functions [itex]sin(x)[/itex] and [itex]cos(x)[/itex], what you do is to break the code for [itex]sin(x)[/itex] and [itex]cos(x)[/itex] and merge it into a [itex]sincos(x)[/itex] function which will be twice as fast and, thus, will make your code much faster.

So it goes for the [itex]log(x)[/itex] function, for BM you have to remember that the [itex]log(x)[/itex] is only applied for numbers within (0,1] that means that using some [itex]log(x)[/itex] implementations is a waste of resources since higher orders of the polynomial adds virtually nothing to the accuracy. So let's say that [itex]log(x)[/itex] uses a polynomial of degree 10, since you know you will never calculate numbers higher than 1, you can create a function [itex]log01(x)[/itex] which only uses a polynomial of degree 5 as machine accurate as the [itex]log(x)[/itex] function but twice as fast.

So, I just explained a couple of tricks that can really speed up your code if you use them instead applying [itex]sin(x)[/itex] , [itex]cos(x)[/itex] and [itex]log(x)[/itex] right away.

And actually the fun goes ooooon... But you're fun, right?

andrewr · Jun 26, 2012

viraltux said:

Well I said I doubted very much they use just polynomials cause the tails cannot be nicely approximated with them, so it seems R uses for the tails a ratio of polynomials which it is not a polynomial itself.

Yes, you are undoubtedly correct. Exactly. I just didn't understand what you meant...

But I disagree that using polynomials makes the code "exceedingly slow" in fact polynomials are computationally very fast structures to calculate since you have a cost O(n-1) for nth degree polynomials. That means that most of the time the cost to calculate such function in R will be around O(4), which is great.

Not really ... In theory, yes -- but in practice, no.
"R" is using code from the 1960's-80's which was for CISC machines.
Cache latencies and piplining issues, were not yet part of the programming milieu;
Foss for example, has several *if* statements in the code deciding when to use which polynomial ... that's exactly what slowed my code for the MPM down horribly compared to my estimate of it's speed. (I'll explain later.)

On most modern machines, the brute force operations of addition and multiplication both have the same time delay, so it's more like 7 operations to do a 4th degree polynomial.

7 operations is about the same time as it takes to compute sin or cos to the full double precision on my machine... But a 4th degree poly is nowhere near able to achieve the same precision.

If you look at the source code I already posted -- you'll notice that I use second and third degree polynomials to avoid computing exp(-0.5*x**2) as much as possible.
I use these polynomials > 98% of the time, compute exp() for the remainder.
I even did alignments for power of two operations where possible, since that doesn't even require a multiply instruction; but only an integer add...

For a particular number of repetitions -- I think it was around a billion -- I have time measurements for each of the math--co operations on my machine.

multiply and divide : 2 seconds (or less)
sin() and cos(): 15 seconds
sqrt(): 36 seconds.
exp(): 46 seconds.
log(): 87 seconds.

In theory, then, I expected my program to 98% of the time to do polynomials at a weight of *less* than 8 seconds. And 2% of the time, to run exp at 46+2+2 ~ 50 seconds;
That works out to <9 seconds average...
Box muller (ignoring multiplies) would be 87+36+15=138 seconds.
(Which isn't far off from reality...)

Hence, the MPM method ought to be on the Order of 15 times fasterthan BM.
But it's nowhere *NEAR* that much faster -- it is, in fact, only *slightly* faster.

If anyone sees a coding mistake, let me know -- but I don't think there is anything wrong with the way I wrote the program.

The main issue for my machine, and these issues are *VERY* common in all desktop machines whether powerPC, or Intel, and even some ARM processors...
Is that failed branch prediction must break a very deep pipeline in the processors.
On an Intel Celeron, the pipeline depth is around 40. That means, every time an "if" statement changes the flow of the machine -- the CPU runs around 40 times slower than normal. A single *if* statement can be very expensive, time wise. A 2.5GHz processor is reduced to 62MHz for a brief moment... There is no way to program preferred branch directions, either, in the C language. It's all up the CPU's automatic discretion... (if it has any bug free ability at all...).

There are many performance techniques we can use to speed up algorithms, you can really do wonders if you're determine.

The main thing I need to do is calculate a single exact formula to produce the Gaussian -- with no *if* statements included in the algorithm. These techniques you mention are all very good (I'm an EE, and although I do hardware -- I'm *very* familiar with these issues.)

So, I just explained a couple of tricks that can really speed up your code if you use them instead applying [itex]sin(x)[/itex] , [itex]cos(x)[/itex] and [itex]log(x)[/itex] right away.

And actually the fun goes ooooon... But you're fun, right?

Yes, I'm into fun. Lots of it.
I really do want to stay away from machine architecture dependent coding.
That means I can't achieve the "highest" performance -- but it also means I can share the code with anyone; I know from hardware design experience that the relative timing of my machine's math co. hold for many other processors as well.
Sin and Cos don't appear to be implementable by any method I know which will be faster using polynomials, than just using the built-in function. Log(), however, could easily be replaced -- and I have some experimental (non-polynomial) formulas that might be well worth curve fitting...

I'm learning a bunch about how a uniform random plugged into a formula produces a pdf ! eg: f( u1 ) --> a pdf of --> d/dx f^-1( x ).
This appears to be key in designing a more easily computed Gaussian generator.

There is much that I am studying in BoxMuller, some of which is just for knowledge -- other parts explorations for speed-up.

One fairly interesting path is to consider sin and cos: Since a function expressed as sin() is at 90 degrees to one expressed with cos(); I was thinking it may be possible to use this relationship to compute an anti-function of something expressed as sin(f(x)).

exp(-0.5*x**2) is slow to compute; but when I graph acos( exp( -0.5*x**2 ) ), The resulting graph appears to be a hyperbola of some kind that would be *VERY* easily computed using only multiplication, division, and addition. Hence, it would be around 2.5 times faster than computing the exponential itself, to compute the cosine of a different function... (And it may be *MUCH* faster than computing the anti-exponential directly as BoxMuller does...!)

Does anyone happen to know what the formula of the hyperbola is for acos( exp(...) ) ?

haruspex · Jun 26, 2012

andrewr said:

exp(-0.5*x**2) is slow to compute; but when I graph acos( exp( -0.5*x**2 ) ), The resulting graph appears to be a hyperbola of some kind

Doesn't look like that to me. There's a right-angle at the origin (gradient -1 to the left, +1 to the right), and an asymptote at y = π/2.
I guess if you look at x > 0 only, it does resemble a hyperbola. Taking the gradient near the origin to represent the other asymptote, the angle between asymptotes on the same branch is 3π/4. So the hyperbola should look like y² = tan²(3π/8)x² - c, rotated clockwise by 3π/8 and suitably translated.

viraltux · Jun 27, 2012

andrewr said:

R" is using code from the 1960's-80's which was for CISC machines. Cache latencies and piplining issues, were not yet part of the programming milieu;...

OK, I admit you are are fun and you caught my curiosity, so I downloaded the code and checked qnorm.c and I found this:

Code:

/*-- use AS 241 --- */
/* double ppnd16_(double *p, long *ifault)*/
/*      ALGORITHM AS241  APPL. STATIST. (1988) VOL. 37, NO. 3

        Produces the normal deviate Z corresponding to a given lower
        tail area of P; Z is accurate to about 1 part in 10**16.

        (original fortran code used PARAMETER(..) for the coefficients
         and provided hash codes for checking them...)
*/
    if (fabs(q) <= .425) {/* 0.075 <= p <= 0.925 */
        r = .180625 - q * q;
	val = q * (((((((r * 2509.0809287301226727 +...

So since the algorithm was published in 1988 at least this code seems it was developed in the early 90s rather than the 60s. Nonetheless, pay attention that 85% of the time you will get your answer back by just checking one if statement.

But anyway, since you also say

andrewr said:

an "if" statement changes the flow of the machine -- the CPU runs around 40 times slower than normal. A single *if* statement can be very expensive, time wise...

The main thing I need to do is calculate a single exact formula to produce the Gaussian -- with no *if* statements included in the algorithm. These techniques you mention are all very good (I'm an EE, and although I do hardware -- I'm *very* familiar with these issues.)

That surprises me because I thought that if statements where suppose to be computationally cheap. OK, it was long long ago in a far far away galaxy when I finished my degree in Computer Science and I had my hands on Assembler delving in the CPU architecture, and since you are an EE and you know your hardware I cannot really discuss deep into it, but I did the following little experiment in my Intel Atom (yeah, I know, but that's what I can afford now) with this code in a high level language (Ruby, my cup of tea join with C)

So I am going to calculate the number 5000000 comparing 1 million additions of 0.5 vs 5000000 additions of 1 within 1 million if statements, these are the results:

Code:

x = 0; 10000000.times{|i| x+=0.5 } #time 5.388s
x = 0; 10000000.times{|i| x+=1 if i%2 == 0} #time 4.236s

So the if statement not only doesn't slow the calculation but actually makes it around 20% faster! So I am not sure why you want to avoid the if statement so badly. Are you sure it has such a dramatic influence in your machine?

andrewr said:

I really do want to stay away from machine architecture dependent coding. That means I can't achieve the "highest" performance -- but it also means I can share the code with anyone; I know from hardware design experience that the relative timing of my machine's math co. hold for many other processors as well. Sin and Cos don't appear to be implementable by any method I know which will be faster using polynomials, than just using the built-in function. Log(), however, could easily be replaced -- and I have some experimental (non-polynomial) formulas that might be well worth curve fitting...

I'm learning a bunch about how a uniform random plugged into a formula produces a pdf! eg: f( u1 ) --> a pdf of --> d/dx f^-1( x ). This appears to be key in designing a more easily computed Gaussian generator. There is much that I am studying in BoxMuller, some of which is just for knowledge -- other parts explorations for speed-up.

One fairly interesting path is to consider sin and cos: Since a function expressed as sin() is at 90 degrees to one expressed with cos(); I was thinking it may be possible to use this relationship to compute an anti-function of something expressed as sin(f(x)).

You have the instruction fsincos in assembler, which actually eliminates redundancies as I explained before, but since I am not sure the instruction is found in every architecture I guess you might not want to use it. But if you are using sin I'd say the cheapest would be using cos as well, any other high level trick seems to me is not going to be faster than that.

andrewr said:

exp(-0.5*x**2) is slow to compute; but when I graph acos( exp( -0.5*x**2 ) ), The resulting graph appears to be a hyperbola of some kind that would be *VERY* easily computed using only multiplication, division, and addition. Hence, it would be around 2.5 times faster than computing the exponential itself, to compute the cosine of a different function... (And it may be *MUCH* faster than computing the anti-exponential directly as BoxMuller does...!)

Does anyone happen to know what the formula of the hyperbola is for acos( exp(...) ) ?

I agree with haruspex it does not look anything like a hyperbola.

andrewr said:

Hence, the MPM method ought to be on the Order of 15 times fasterthan BM. But it's nowhere *NEAR* that much faster -- it is, in fact, only *slightly* faster.

Maybe you want to check this The Monty Python Method for Generating Random Variables from master Marsaglia. You may have a look at his code and try again.

andrewr · Jun 29, 2012

viraltux said:
OK, I admit you are are fun and you caught my curiosity, so I downloaded the code and checked qnorm.c and I found this:
Code:
/*-- use AS 241 --- */
/* double ppnd16_(double *p, long *ifault)*/
/*      ALGORITHM AS241  APPL. STATIST. (1988) VOL. 37, NO. 3

        Produces the normal deviate Z corresponding to a given lower
        tail area of P; Z is accurate to about 1 part in 10**16.

        (original fortran code used PARAMETER(..) for the coefficients
         and provided hash codes for checking them...)
*/
    if (fabs(q) <= .425) {/* 0.075 <= p <= 0.925 */
        r = .180625 - q * q;
	val = q * (((((((r * 2509.0809287301226727 +...
So since the algorithm was published in 1988 at least this code seems it was developed in the early 90s rather than the 60s.

That may have been when the particular piece of code was published but that may not be when it was developed. Fortran code was being phased out by 1985 -- I still learned it as an EE ... either way, much of the code is older (and there is nothing wrong with old code that is well written...).

Nonetheless, pay attention that 85% of the time you will get your answer back by just checking one if statement.

But anyway, since you also say

...

That surprises me because I thought that if statements where suppose to be computationally cheap. OK, it was long long ago in a far far away galaxy when I finished my degree in Computer Science and I had my hands on Assembler delving in the CPU architecture, and since you are an EE and you know your hardware I cannot really discuss deep into it, but I did the following little experiment in my Intel Atom (yeah, I know, but that's what I can afford now) with this code in a high level language (Ruby, my cup of tea join with C)

So I am going to calculate the number 5000000 comparing 1 million additions of 0.5 vs 5000000 additions of 1 within 1 million if statements, these are the results:
Code:
x = 0; 10000000.times{|i| x+=0.5 } #time 5.388s
x = 0; 10000000.times{|i| x+=1 if i%2 == 0} #time 4.236s
So the if statement not only doesn't slow the calculation but actually makes it around 20% faster! So I am not sure why you want to avoid the if statement so badly. Are you sure it has such a dramatic influence in your machine?

I take it ruby is an interpreted language? There might be something in the interface...
I have no idea why doing more work would take less time... ! Your example is absolutely baffling...!

Now, in Ruby, since you have nothing after the if statement -- am I understanding correctly -- the syntax to mean only add 1 when the index is even? eg: do nothing when odd? (I'm not very familiar with Ruby... I have done more PHP do be honest.)

BUT: I'm sure that my code isn't a factor of 10x + faster; which in theory is ought to be... Of course, I didn't account for the random generator's percent of processing time -- I was just expecting a much quicker response than I got!; Hence I clearly made a mistake *somewhere* ; (P.S. my unsure nature is why I asked if anyone saw a mistake in my code)

In any event:
The issue is called "branch prediction". If the processor has hardware smart enough to decide which path is *more* likely, then the penalty only occurs on the occasional time where the prediction is wrong. The rest of the time, the "if" statements are *very* cheap.

Supposedly (I haven't checked the comment made by a friend) Intel processors keep some kind of statistic to predict which is more likely on the next pass. That includes the Atom.

However, even if my code were somehow wrong -- if your example is portable into C as your cup of tea (are YOU sure!) -- then my code ought to have been *much* *much* *much* ******MUUUUCHH********** faster !

You have the instruction fsincos in assembler, which actually eliminates redundancies as I explained before, but since I am not sure the instruction is found in every architecture I guess you might not want to use it. But if you are using sin I'd say the cheapest would be using cos as well, any other high level trick seems to me is not going to be faster than that.

Yes, and I think that shows up on many other architectures -- I need to check the Arm processors, as that is one I'd like to run a calculator on in the future (not quite an Android app... but close).

However, Even if I just got the speedup from using sin() vs. using exp() -- I'd be quite happy!.
It's quick enough...

I agree with haruspex it does not look anything like a hyperbola.

*sigh* Yes ... it doesn't when graphed negative and positive; but haruspex was bright enough to figure out what I meant -- eg: it resembles one when graphed from 0 to positive.

I have come across an almost identical shape before in another project I did some time ago, and it ended up being a ratio of something like (this is going to be wrong... of course) x/(3+x); There may have been square terms in it, etc. I don't remember ... but there was something that caused the shape change at zero.

I really am going to have to come back to that problem once I find my backup of the earlier work -- hopefully that has the answer I am seeking. It will be a good estimate, regardless ...

I do find it irritating I can't just solve the problem analytically; I mean Euler's identity has the exponent in it already: cos( t ) + i*sin( t ) = exp( i*t ) -- so converting from exp to cos() or sin() just shouldn't be that hard. !

I am sure I have seen an exact answer to this problem in a CRC handbook or something. It wasn't a very complicated expression... But, I don't see how to solve it analytically myself.

If anyone does KNOW the exact answer -- please share, please... ;) (Even if it ISNT a hyperbola..!)

Maybe you want to check this The Monty Python Method for Generating Random Variables from master Marsaglia. You may have a look at his code and try again.

Well, I'll certainly benchmark his for speed against mine using the same random generator; and I'll do a plot of the output to see how it compares against the bell curve; but it will be a few days before I have time...
If his is faster and already has the tail solved correctly -- it might just be better to use his code and optimize the random generator. (I have a potential trick for that... !)

But:
I have another issue I'm working out analytically, that is more important to solve first
(next post.)

Thanks so much for the link; that may be good enough.

andrewr · Jun 30, 2012

Multiplications of Gaussians aren't Gaussians...

Ok, I mentioned earlier that the product of Gaussians -- aren't Gaussians.
In this first part of the problem, I am interested in solutions where the Gaussians being multiplied aren't correlated.

This problem comes up quite often in error propagation formulas; and so is interesting to study.

As an example, I attached two such operations; appropriately labeled -- and they both obviously result in non-Gaussians.A typical reference (USA/NIST) indicates that for multiplication statistics one ought to use:

[tex]s_{area }= \sqrt { width^{2} \times s _ {length} ^ {2} + length ^ {2} \times s _ {width} ^ 2 }[/tex]

http://www.itl.nist.gov/div898/handbook/mpc/section5/mpc55.htm

The note says that this particular formula ignores correlation -- which (since there is none) -- is irrelevant to my problem. BUT -- When I try to solve the problem analytically, I get a different formula than shown in NIST.

NIST also list an exact formula, which includes correlation -- and is a solution I don't understand. It appears to have many terms that I wouldn't know exactly what they are... If anyone can show if it reduces to the other formula in absence of correlation, I'd be interested.

I looked around for other listings of error propagation formulas ... and I generally come across this same formula shown in NIST.

eg: In Stower's research for medical applications...
research.stowers-institute.org/efg/Report/PropagationOfErrors.pdf

And, in typical papers I find elsewhere... there is one alternate variation I often see:
http://www.physics.uc.edu/~bortner/...pendix 2/Appendix 2 Error Propagation htm.htm

All they are doing in the alternate form (really) is converting to ratios or percentages, and then doing the same propagation formula as NIST does. However, that variation of the formula also will give a divide by zero for the problem shown in attachment #2... which is actually worse...

So, what is *really* wrong here? :) I *SERIOUSLY* can't find what I did wrong when attempting to derive NIST's formula; and this is just the simple case... ugh!

In order to solve the equation, I am going to assume the normals are in a special form -- with standard deviation always equal to one. It makes for less variables during the calculus, and reduces the odds I will make a mistake... *sigh*.

So, start with two normals N(a,1) and N(b,1), and solve for the product...
In this post, I'm going to solve for the mean; If no one sees a mistake, I'll then solve for the deviation.

Begin:

The probability of each multiplication element is:

[tex]p(X,Y)={ 1 \over 2\pi } ( e^{-{ 1 \over 2}(x-a)^2} e^{-{ 1 \over 2}(y-b)^2} )[/tex]
The elements of the mean for multiplication are:

[tex]\mu_{e} ( X,Y )=xy[/tex]

Which makes the weighted mean's elements:

[tex]p(X,Y) \mu_{e} ( X,Y ) = { 1 \over 2\pi } xy( e^{-{ 1 \over 2}(x-a)^2} e^{-{ 1 \over 2}(y-b)^2} )[/tex]

Combining yields:

[tex]\mu = \int \limits _{ - \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty }{ 1 \over 2 \pi } xy( e^{-{ 1 \over 2 }(x-a)^2} e^{-{ 1 \over 2 }(y-b)^2} ) dxdy[/tex]

Changing to polar coordinates to allow fixed radii vs probability pairs:

[itex]r^{2}=(x-a)^{2}+(y-b)^{2}[/itex] and [itex]x-a=cos( \theta )r[/itex] and [itex]y-b=sin( \theta )r[/itex]

[tex]dxdy=r \times d \theta dr[/tex]

[tex]\mu = \int \limits_{ 0 }^{ \infty } \int \limits _{0}^{ 2 \pi } { 1 \over 2 \pi } (cos( \theta )r+a)(sin( \theta )r+b) e^{-{ 1 \over 2}r^2} (r \times d \theta dr)[/tex]

[tex]\mu = \int \limits_{ 0 }^{ \infty } \int \limits _{0}^{ 2 \pi } { 1 \over 2 \pi } (sin( \theta )cos( \theta )r^{2} + cos(\theta)br + sin( \theta )ar + ab) e^{-{ 1 \over 2}r^2} d \theta dr[/tex]

All of the integrated trigonometric terms will have the same value at 0 and [itex]2 \pi[/itex] and may be safely ignored.

[tex]\mu = { 1 \over 2 \pi } \int \limits_{ 0 }^{ \infty } \left . (ab \theta) \right | _{0}^{ 2 \pi } re^{-{ 1 \over 2}r^2} \times dr[/tex]

[tex]\mu = ab \int \limits_{ 0 }^{ \infty } r e^{-{ 1 \over 2}r^2} \times dr[/tex]

[tex]\mu = ab[/tex]

So, when multiplying N(5,2) * N (3,7); it will be the same as 2*7*(N(5/2,1),N(3/7,1)
And that would yield: 5/2*3/7 = 15/14 * 2*7 = 15. In general, then, the mean is always the product of the original means for uncorrelated Gaussians.

That, everyone seems to agree to ... but did I mess up or is this, "so far, so good?"

viraltux · Jun 30, 2012

andrewr said:

I take it ruby is an interpreted language? There might be something in the interface...
I have no idea why doing more work would take less time... ! Your example is absolutely baffling...!
Now, in Ruby, since you have nothing after the if statement -- am I understanding correctly -- the syntax to mean only add 1 when the index is even? eg: do nothing when odd? (I'm not very familiar with Ruby... I have done more PHP do be honest.)

BUT: I'm sure that my code isn't a factor of 10x + faster; which in theory is ought to be...

However, even if my code were somehow wrong -- if your example is portable into C as your cup of tea (are YOU sure!) -- then my code ought to have been *much* *much* *much* ******MUUUUCHH********** faster !

Ported to C

Code:

void main(int argc, char *argv[]) {
   double x = 0;
   for(int i=0; i<100000000; i++){ x += 0.5; } /* time  1.244s */
   for(int i=0; i<100000000; i++){ if (i%2==0) {x += 1;}} /* time 0.904s */
   printf("x is %f\n", x);
}

Now I am sure... The if statement, you got to love it

I do find it irritating I can't just solve the problem analytically; I mean Euler's identity has the exponent in it already: cos( t ) + i*sin( t ) = exp( i*t ) -- so converting from exp to cos() or sin() just shouldn't be that hard. !

Well, it's not, just use the expi (for imaginary numbers) function if you don't want to use fsincos. But if you're interested in the real part I'd say the sin, cos is not going to help you.

I am sure I have seen an exact answer to this problem in a CRC handbook or something. It wasn't a very complicated expression... But, I don't see how to solve it analytically myself.

If anyone does KNOW the exact answer -- please share, please... ;) (Even if it ISNT a hyperbola..!)

The point is that you can already approximate efficiently exp with ratio of polynomials, and you're looking for a ratio of polynomials that calculates asin(exp) then applying sin again... does this makes sense?

If his is faster and already has the tail solved correctly -- it might just be better to use his code and optimize the random generator. (I have a potential trick for that... !)

Bring it on!

viraltux · Jun 30, 2012

andrewr said:

Ok, I mentioned earlier that the product of Gaussians -- aren't Gaussians.
In this first part of the problem, I am interested in solutions where the Gaussians being multiplied aren't correlated.

This problem comes up quite often in error propagation formulas; and so is interesting to study.

As an example, I attached two such operations; appropriately labeled -- and they both obviously result in non-Gaussians.A typical reference (USA/NIST) indicates that for multiplication statistics one ought to use:

[tex]s_{area }= \sqrt { width^{2} \times s _ {length} ^ {2} + length ^ {2} \times s _ {width} ^ 2 }[/tex]

http://www.itl.nist.gov/div898/handbook/mpc/section5/mpc55.htm

The note says that this particular formula ignores correlation -- which (since there is none) -- is irrelevant to my problem. BUT -- When I try to solve the problem analytically, I get a different formula than shown in NIST.

NIST also list an exact formula, which includes correlation -- and is a solution I don't understand. It appears to have many terms that I wouldn't know exactly what they are... If anyone can show if it reduces to the other formula in absence of correlation, I'd be interested.

I think the NIST article is a bit misleading, the general formula is not for the variance of a product of two variables, but for the square of the standard error for the product of the means of two samples. Pay attention to the n; when it goes to infinity the standard error goes to zero. So this formula does not measure the variance of a product as it claims.

You can learn to derive the general formula for the product of to variables with this paper:

http://www.jstor.org/discover/10.2307/2286081?uid=3737952&uid=2&uid=4&sid=47699110527507

In the general expression derived in this paper you can see that when X and Y are not correlated many terms disappear and the remaining expression is the one you post here: [itex]s_{area }= \sqrt { width^{2} \times s _ {length} ^ {2} + length ^ {2} \times s _ {width} ^ 2 }[/itex]

andrewr · Jun 30, 2012

Now I am sure... The if statement, you got to love it

Yes,

I do understand why it's faster now; I didn't notice before that the predicate of the if statement was an integer. So, you are doing one integer operation or one floating point operation; although modulus is often two operations (division, multiplication, subtraction) -- I think that integer division has a remainder in assembly...

If you switch the variable i, to a float or double, then your code has the same characteristics as mine does. The predicate of the float is a float...

But just getting a rough estimate of the speed of the *if* with your loop (ignoring the for loop itself, as that is not a true if statement but a repeat statement in assembly -- perfect branch prediction) I am pretty sure my earlier estimate of the pipeline delay breaking (which is based on intel's own data, not measurement) doesn't explain why my code is so much slower than expected...

Hmmm...

Well, it's not, just use the expi (for imaginary numbers) function if you don't want to use fsincos. But if you're interested in the real part I'd say the sin, cos is not going to help you.

I'll have to come back to this... hold that thought for a while...
Yes, your ideas *do* make sense.

andrewr · Jun 30, 2012

viraltux said:

I think the NIST article is a bit misleading, the general formula is not for the variance of a product of two variables, but for the square of the standard error for the product of the means of two samples. Pay attention to the n; when it goes to infinity the standard error goes to zero. So this formula does not measure the variance of a product as it claims.

You can learn to derive the general formula for the product of to variables with this paper:

http://www.jstor.org/discover/10.2307/2286081?uid=3737952&uid=2&uid=4&sid=47699110527507

In the general expression derived in this paper you can see that when X and Y are not correlated many terms disappear and the remaining expression is the one you post here: [itex]s_{area }= \sqrt { width^{2} \times s _ {length} ^ {2} + length ^ {2} \times s _ {width} ^ 2 }[/itex]

OK, and when I scan through the paper -- the initial steps do verify that he takes the mean to always be the simple product of means as well...And I see the next step for computing a variance is also the same as mine... but then he uses simple algebra; and although I used calculus -- I should have gotten the same answer -- and I just double checked, and I still can't find my mistake.

To make it worse, my answer actually agrees more with my posted experiments ... which could just be be a fluke...

I wonder, is there some kind of assumption in Goodman that I missed?
This is my path to it:

The variances for multiplication are:

[tex]\sigma ^{2} ( X,Y )=( xy - ab )^{2}[/tex]

And the weighted variances:

[tex]p(X,Y) \sigma^{2} (X,Y) = { 1 \over 2\pi }( xy - ab )^{2} ( e^{-{ 1 \over 2}(x-a)^2} e^{-{ 1 \over 2}(y-b)^2} )[/tex]

[tex]\sigma ^{2} = \int \limits _{ - \infty }^{ \infty } \int \limits_{ - \infty }^{ \infty }{ 1 \over 2 \pi } (xy-ab)^2( e^{-{ 1 \over 2 }(x-a)^2} e^{-{ 1 \over 2 }(y-b)^2} ) dxdy[/tex]

Changing to polar coordinates to allow fixed radii vs probability pairs:

[itex]r^{2}=(x-a)^{2}+(y-b)^{2}[/itex] and [itex]x-a=cos( \theta )r[/itex] and [itex]y-b=sin( \theta )r[/itex] and [itex]dxdy=r \times d \theta dr[/itex]

[tex]\sigma ^{2} = { 1 \over 2 \pi } \int \limits_{ 0 }^{ \infty } \int \limits _{0}^{ 2 \pi } ((cos( \theta )r+a)( sin( \theta )r+b) - ab)^{2} e^{-{ 1 \over 2}r^2} (r \times d \theta dr)[/tex]

Expanding the xy substitution inside the parenthesis, factoring common terms...

[tex]\sigma^{2} = { 1 \over 2 \pi } \int \limits_{ 0 }^{ \infty } \int \limits _{0}^{ 2 \pi } (sin( \theta )cos( \theta )r^{2} + cos(\theta)br + sin( \theta )ar + ab -ab)^2 re^{-{ 1 \over 2}r^2} d \theta dr[/tex]

[tex]\sigma^{2} = { 1 \over 2 \pi } \int \limits_{ 0 }^{ \infty } \int \limits _{0}^{ 2 \pi } ( sin( \theta )cos( \theta )r + ( cos(\theta)b + sin( \theta )a ) )^2 r^3e^{-{ 1 \over 2}r^2} d \theta dr[/tex]

[tex]\sigma^{2} = { 1 \over 2 \pi } \int \limits_{ 0 }^{ \infty } \int \limits _{0}^{ 2 \pi } g( r, \theta ) r^3e^{-{ 1 \over 2}r^2} d \theta dr[/tex]

Expanding the trigonometric terms' square.

[tex]g( r, \theta ) = sin^2( \theta )cos^2( \theta )r^2 + 2sin( \theta )cos^2( \theta )br + 2sin^2( \theta )cos( \theta )ar + (cos^{2}( \theta )b^{2} + 2cos( \theta )sin( \theta )ab + sin^{2}( \theta )a^{2})[/tex]Integrating the expansion with respect to [itex]\theta[/itex], r being constant.

[tex]\int \limits _{ 0 }^{ 2 \pi }g( \theta, r )d \theta =[/tex]

[tex]\left . \left ( {(4 \theta - sin(4 \theta)) \over 32}r^{2} - { 2 \over 3 } cos^3( \theta )br + {2 \over 3} sin^{3}( \theta )ar + { \theta + sin( \theta )cos( \theta ) \over 2 }b^2 - {cos( 2 \theta ) \over 2}ab+{ \theta - sin( \theta )cos( \theta ) \over 2} a^2 \right ) \right | _{0}^{2 \pi }[/tex]

Any trigonometric term will have the same value at 0 and [itex]2 \pi[/itex] and cancel, leaving.

[tex]\int \limits _{ 0 }^{ 2 \pi }g( \theta )d \theta = \left . \left ( { \theta \over 8 }r^{2} + { \theta \over 2 }b^2 +{ \theta \over 2} a^2 \right ) \right | _{0}^{2 \pi } = 2 \pi \left ( {r^{2} \over 8} + { a^{2} \over 2 } + { b^{2} \over 2 } \right )[/tex]

Now, Back-substituting the result...

[tex]\sigma^{2} = { 1 \over 2 \pi } \int \limits_{ 0 }^{ \infty } 2 \pi \left ( {r^{2} \over 8} + { a^{2} \over 2 } + { b^{2} \over 2 } \right ) r^3e^{-{ 1 \over 2}r^2} dr[/tex]

[tex]\sigma^{2} = { 1 \over 8 } \int \limits_{ 0 }^{ \infty } r^5e^{-{ 1 \over 2}r^2} dr + { a^{2} + b^{2} \over 2 } \int \limits_{ 0 }^{ \infty } r^3e^{-{ 1 \over 2}r^2} dr[/tex]

[tex]\sigma^{2} = { 1 \over 8 } \left . \left ( -e^{-{ x^2 \over 2 }}(x^4 + 4x^2 + 8 ) \right ) \right | _{ 0 }^{ \infty } + { a^{2} + b^{2} \over 2 } \left . \left ( -e^{-{ 1 \over 2 }x^{2} }( x^2 + 2 ) \right ) \right | _{ 0 }^{ \infty }[/tex]

[tex]\sigma^{2} = { 1 \over 8 } \left ( 0 - -(1)(0+0+8) \right ) + { a^{2} + b^{2} \over 2 }( (0) - (-2) )[/tex]

[tex]\sigma^{2} = 1 + \left ( { a^2 + b^2 } \right )[/tex]Taking the final result, and extending it to values with arbitrary stds;

[tex]N( a, \sigma _ {a} ) \times N( b, \sigma _ {b} ) = \sigma _ {a}\sigma _ {b} \times N( { a \over \sigma _ {a} }, 1 ) \times N( { b \over \sigma _ {b} }, 1 )[/tex]

[tex]N( a, \sigma _ {a} ) \times N( b, \sigma _ {b} ) \rightarrow \mu = ab , \sigma = \sqrt { ( \sigma _ {a} b )^{2} + ( \sigma _ {a}\sigma _ {b} )^{2} + ( a \sigma _ {b} )^{2} }[/tex]

So, my formula only differs in one term from that posted on NIST and all over the internet...

SO HERE'S Andrew's megalomania version...

It's me against the world (and likely to loose...! again...)

I get,
[tex]\sigma _ {width \times length} = \sqrt{ width ^ {2} \times \sigma _ {length} ^ {2} + \sigma _ {length} ^ {2} \times \sigma _ {width} ^ {2} + length ^ {2} \times \sigma _ {width} ^ {2} }[/tex]

Oy!, and so how did I blow it this time? (I thought I was getting better at the calculus up and until now...)

viraltux · Jun 30, 2012

andrewr said:

Yes,
I do understand why it's faster now; I didn't notice before that the predicate of the if statement was an integer. So, you are doing one integer operation or one floating point operation; although modulus is often two operations (division, multiplication, subtraction) -- I think that integer division has a remainder in assembly...

If you switch the variable i, to a float or double, then your code has the same characteristics as mine does. The predicate of the float is a float...

Well, but obviously it depends on the condition, the more calculations in the condition the slower is going to get! My point is that the poor old if statement by itself is not to blame.

andrewr said:

But just getting a rough estimate of the speed of the *if* with your loop (ignoring the for loop itself, as that is not a true if statement but a repeat statement in assembly -- perfect branch prediction) I am pretty sure my earlier estimate of the pipeline delay breaking (which is based on intel's own data, not measurement) doesn't explain why my code is so much slower than expected...

But anyway, one thing you can try is to change whatever conditions you have for something as simple and predictable as 1==1 and, this way, you can measure how much your conditions slow the program. At least you'll know how much blame to can place on that.

viraltux · Jun 30, 2012

andrewr said:

So, my formula only differs in one term from that posted on NIST and all over the internet...

SO HERE'S Andrew's megalomania version... It's me against the world (and likely to loose...! again...)

I get,
[tex]\sigma _ {width \times length} = \sqrt{ width ^ {2} \times \sigma _ {length} ^ {2} + \sigma _ {length} ^ {2} \times \sigma _ {width} ^ {2} + length ^ {2} \times \sigma _ {width} ^ {2} }[/tex]

Oy!, and so how did I blow it this time? (I thought I was getting better at the calculus up and until now...)

You're right! The Internet is wrong!

Well, after NIST getting wrong the second formula I should not have trusted the first either! And now NIST goes out of my bookmarks and, yesss, there it goesssss.

So I went back to check Goodman's formula cause I could not see anything wrong in your development, and there it was, the extra term that you've just calculated is there in Goodman's formula! to be precise is [itex]E(Δx^2Δy^2)[/itex] which happens not to be zero when X,Y are independent unlike the rest of the terms that go away in that situation!

Oh my God... if you cannot trust random guys in the Internet who're you going to trust now.

andrewr · Jun 30, 2012

viraltux said:

Well, but obviously it depends on the condition, the more calculations in the condition the slower is going to get! My point is that the poor old if statement by itself is not to blame.

You may be right; but my point was that the if statement is the only thing left *unless* there is a mistake in my code someplace;

perhaps I ought to have counted the instructions in detail before making the comment...

In reality, I ought to re-measure *everything* -- and re-test; but I am hesitant to do so -- as fixing the random generator needs to be done first to avoid wasting a lot of time -- and that's why I haven't yet.

But anyway, one thing you can try is to change whatever conditions you have for something as simple and predictable as 1==1 and, this way, you can measure how much your conditions slow the program. At least you'll know how much blame to can place on that.

Oh, how much bliss would be my ignorance...

No, I really can't -- if I set it to 1==1, the gcc compiler will remove the "if" altogether -- and since it takes the "if" to determine the percentage of execution of each path -- such a change must at least include a calculation that allows me to estimate the performance hit... I can only remove fixed computations, such as that for exp().

There are, inside intel's processor, several hardware optimizations that overlap;
there is out-of-order execution of instructions, pipeline-delay, cache-delay, and math co-processor de-coupling. If I had made the hardware myself, it would be easy to weigh which is causing what -- but even professional software engineers tell me that the published specifications do not always agree with performance. eg: some were attempting to make an optimized movie player for Intel -- only to find that the optimized version ran slower than the un-optimized version because the processor did not optimize in the same way the intel engineers said it would.

I would like to think that *none* of these are causing the problem -- however, I have no alternate explanation at this time and it doesn't look to be simple. (But I wouldn't be surprised if it was just a dumb mistake of mine or a transcription error made much earlier...)

(P.S. Considering we have optimizations uncertainties in the loop -- *MULTIPLIED* by other uncertainties -- what formula ought I use to calculate the range (product) of the uncertainties?

)

Here's the code I wrote, it's short enough to inspect right here:

PHP:

double MontyPythonGaussianCore(void) {
// A rapid partitioning of the bell curve into optimized computation areas 
// that can be additively re-maped for fast generation of a normal distribution 
// The program as shown does not have tail correction...
// And hasn't been carefully debugged; test it before using it...
//
// See:
// https://www.physicsforums.com/attachment.php?attachmentid=48387&d=1339819476
// https://www.physicsforums.com/showthread.php?t=611422
//
static const double spin  =0.966804869573191;
static const double spin2 =0.966804869573191*2.0; // insure compute once... 
static const double iTop  =1.59576912160573;      // 1.0 / sqrt(pi/8)

	double z = Random01()*2.;
	double u1, lowerN, upperN, ebar;

	if (z<=spin) return z; // Area region A1 always returns...

	if (z>spin2) { // Tail region computation for sure... GO recursive!
		return  spin2 + MontyPythonGaussianCore()*0.609743713386175;
		// A correction ought to go here, but I'm ignoring it for now.
	}

	// Otherwise, it must be in regions A2,A3,error.
	// And requires a "y" point.
	u1 = Random01();

	// Compute an estimated bound -- and if too close -- compute exactly
	ebar   = 0.008; // maximum polynomial error is 0.8% (<0.5% avg).
	lowerN =((0.27*z)-1.57009)*z+2.26458;

	// The following if operation could be re-ordered with the return's
	// depending on which is faster, a caluculation -- or a cache miss...
	if (u1      <= lowerN) return z;
	if (u1-ebar <= lowerN) {
		lowerN = exp(-0.5*z*z) * iTop; // Compute the exact value.
		if (u1<=lowerN) return z; // barely below/on the curve of A2.
	 }

	 // At this point, the data *must* be above curve a2. 
	 // curves e0, and a3 are mirrored...
	 z = spin2 - z;

	 ebar=0.00246; // Maximum polynomial error is 0.246% (<~0.15% avg.)
	 upperN  =  (( -0.2922*z + 0.9374)*z - 0.0171)*z + 0.40485;
	 
	 if (u1        > upperN) return z;
	 if (u1 + ebar > upperN) {
		upperN = 2.0 - exp( -0.5*z*z ) * iTop;
		if (u1 >= upperN) return z;
	 }

	 // There is no other possibility, the value is an e0
	 // Return it as a tail region data point...
	 
	 return spin2 + z*1.02803051369197;
}

The first "*if*" statement returns ~48% of the time; A true multiply is not needed since a multiply by 2.0 is the same as an exponent adjust of +1; So -- This code *OUGHT* to be reaaaally fast. I'd guestimate -- 1% of BM time or better...

So, in order for the whole algorithm to be only slightly faster than BM (say 80% of BM's time as a random figure...), we would need 1% * 0.48 + x%*0.52 = 80%;

So, that's about x% = 152% ; eg: I would have to be slower than BM by 50% on average... ?!

Considering that BM uses the log operation at 80+ seconds (not counting the rest of the algorithm)/billion ops. And I am only using very short polynomials at less than 2s/add or multiply -- I find it surprising that my code is even on the same order of magnitude...

Scan through the code, and do an estimate of what you think the execution time ought to be on average... or better; could you isolate the specific thing in the code which you suspect is causing the problem -- and I'll replace that with some kind of dummy statement and see what happens ?

Cheers...

Expectation value for CDF fails

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Attachments

Attachments

Attachments

Attachments

Similar threads

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect