Sum of All Positive Integers n where d+n/d is Prime to 100M

Arman777 · Jun 17, 2019

PeterDonis said:

This is very disappointing and confusing to me, as while Python is expected to be slow, it's not supposed to be that slow, nor is it supposed to invert one's usual intuition about things like, as you put it, factorization is slow but multiplication is fast.

That bugs me too. Since yesterday I am trying to use Numba and PyPy however I couldn't find a way to run them. Maybe you can find a way to speed things by using Numba, PyPy or CPython

Vanadium 50 · Jun 17, 2019

PeterDonis said:

This is very disappointing and confusing to me, as while Python is expected to be slow, it's not supposed to be that slow, nor is it supposed to invert one's usual intuition about things like, as you put it, factorization is slow but multiplication is fast.

Can you try another implementation of Python? I agree with you that this doesn't look right.

I suspect some inefficiencies in memory access. My reasoning behind this is based on the fact that running my code on machines with faster CPU but slower memory slows it down. So, if the fact that there are 100MB long arrays wasn't enough of a tip-off that memory access would be important, measurements confirm it.

PeterDonis · Jun 17, 2019

Vanadium 50 said:

Can you try another implementation of Python?

I don't have any others handy, and anyway I personally am more interested in figuring out what's going on with the standard CPython implementation.

Vanadium 50 said:

I suspect some inefficiencies in memory access.

That's a possibility, yes, but if that's what's going on, I don't see how to fix it, since I'm not doing anything unusual with the Python code.

For example, here is the function that corresponds to your n = 4 case as far as generating the combinations; it's only yielding 3-tuples because it's only iterating over the primes from 3 on (that's what the primes parameter is, a list of those primes); the function that calls this one takes care of multiplying by 2 and testing the particular combination to see if it belongs in the result set. This function, and its counterparts for the other sizes of combinations, is where my code is spending most of its time. (The q parameter is n divided by 2, so it will be 50,000,000 for the n = 10^8 case. The prod function just multiplies together all items in an iterable and returns the product.)

Python:

def _factor_gen_3(primes, q, debug=True):
    for i, a in enumerate(primes[:-2]):
        if a * prod(primes[i + 1:i + 3]) >= q:
            break
        for j, b in enumerate(primes[i + 1:-1]):
            if a * b * prod(primes[i + j + 2:i + j + 3]) >= q:
                break
            for c in takewhile(lambda p: a * b * p < q, primes[i + j + 2:]):
                yield (a, b, c)

Vanadium 50 · Jun 17, 2019

I don't see why that should be slow. The lambda is a bit unusual, but there's no reason it should be slower than an ordinary function.

Maybe a clue would be to try other values of n (100,000,000 in the question asked) and try and infer the computational complexity. Is the slowness O(n), or O(n²) or something else?

PeterDonis · Jun 17, 2019

Vanadium 50 said:

The lambda is a bit unusual

It's just easier than having to define a separate function for something so simple that only appears as a predicate in takewhile. As you say, it shouldn't make any difference to speed.

Vanadium 50 said:

Is the slowness O(n), or O(n²) or something else?

Roughly quadratic from the comparisons I've been able to make.

Vanadium 50 · Jun 17, 2019

PeterDonis said:

Roughly quadratic from the comparisons I've been able to make.

Well, the sieve itself is O(n log log n). So if you are seeing quadratic behavior, it's not the algorithm: it's that executing the python internals is taking longer than it should. Again, I would guess memory, just because it's easy to imagine how the second memory allocation might take more than the first.

PeterDonis · Jun 17, 2019

Vanadium 50 said:

the sieve itself is O(n log log n). So if you are seeing quadratic behavior, it's not the algorithm

I'm not counting the time it takes to pre-compute the list of primes using the sieve; I'm just talking about the time it takes to generate the combinations of prime factors once the list of primes is pre-computed.

Since the combination generator functions use nested for loops, we would expect them to be quadratic or worse if it weren't for the loop truncations. (After all, a straight count of ##r## element combinations taken from ##N## elements total goes like ##N^r## for large ##N##.) But the loop truncations should limit the number of combinations to roughly linear in ##N##--at least I think so, since basically we're just generating all squarefree numbers less than ##N##, which is roughly linear in ##N##.

Vanadium 50 · Jun 17, 2019

PeterDonis said:

loop truncations should limit the number of combinations to roughly linear in N

Right. \frac{6}{\pi^2}N if you want to be precise. But fundamentally it's got to be O(N) since you are testing (most) of the numbers, just in a different order.

Same point as above, though. The algorithm can't be doing this to you. It's got to be something internal to your Python implementation.

PeterDonis · Jun 17, 2019

PeterDonis said:

Roughly quadratic from the comparisons I've been able to make.

After further testing, it looks closer to ##N \log N##.

Vanadium 50 said:

The algorithm can't be doing this to you. It's got to be something internal to your Python implementation.

I'm afraid so, yes.

Vanadium 50 · Jun 17, 2019

If it's NlogN, I wonder if it allocates memory in non-contiguous blocks: perhaps a B-tree. The NlogN would come from it traversing the tree rather than just incrementing a counter.

Python does garbage collection automatically, right? If so, maybe its a side effect of keeping track of objects that gets overwhelmed by these giant arrays. That would also explain why I don't see it; I made C++ give me contiguous arrays.

PeterDonis · Jun 17, 2019

Vanadium 50 said:

I wonder if it allocates memory in non-contiguous blocks

It could, although I'm not sure how that would affect the performance of simple nested for loops over slices of a list that's not being mutated. All of the variable accesses at the C level inside the interpreter should be just pointers.

Vanadium 50 said:

Python does garbage collection automatically, right?

Yes. But disabling automatic gc doesn't affect the execution time, so this doesn't seem to be the issue. (Mostly Python deallocates objects based on their reference count going to 0; the automatic gc is for cases that can't be handled that simply, like circular references. I don't think there are any of those in my code.)

PeterDonis · Jun 17, 2019

Vanadium 50 said:

perhaps a B-tree. The NlogN would come from it traversing the tree

If you mean the interpreter has to do this when allocating small blocks of memory for new objects, yes, this is one of my prime suspects at this point, since one thing the for loops can't avoid doing is allocating new objects on every iteration to do the tests and yield the combinations.

Vanadium 50 · Jun 17, 2019

I am actually thinking if it has very small memory blocks, it needs to figure out where everything is even when reading. But what I can't figure out is why anyone would do this on a modern CPU.

PeterDonis · Jun 17, 2019

Vanadium 50 said:

I am actually thinking if it has very small memory blocks, it needs to figure out where everything is even when reading.

Python objects at the C level are just pointers to data structures. There are variable name lookups, which in the general case are hash table lookups, but for local variables inside functions those are optimized to array accesses.

Vanadium 50 · Jun 17, 2019

OK, I give up. But since the behavior is crazy, any explanation of it is bound to be a little crazy too.

Vanadium 50 · Jun 18, 2019

Some final thoughts:

Can I do better than one second? Maybe a little. As I said, memory access is the bottleneck. A bool is a byte, so I need 100 MB to store my prime array. With clever packing, I can get a factor of at least 16. But how much can I speed up? Well, the prime-finding is 800 ms, and about half of that is memory access, and maybe I can double the speed of that, so there's 200 ms to be gained. About 20%. There are slightly faster` sieves, so maybe I can get a 30% savings, but probably not much more.

Looking at it from the other direction, there's probably 150 ms of memory access that's unavoidable, and similarly 100 ms of arithmetic that's unavoidable, so I am within a factor of maybe 4 of what the hardware can do. That's pretty good.

To do much faster, one needs a different algorithm. An exhaustive test, even one that works with factors, needs to test almost every number in the range for primality. (61% of the odd numbers) That suggests I need to sieve. And sieving already dominates my time. What I am thinking of is a) if there is one solution, can I use it to find another solution? b) is there some way of generating solutions from "seeds" - other numbers from which a solution can be derived? c) if I exclude a candidate solution, is there some way to infer that another candidate can be excluded? Otherwise I would be nibbling around the edges.

PeterDonis · Jun 18, 2019

Vanadium 50 said:

An exhaustive test, even one that works with factors, needs to test almost every number in the range for primality. (61% of the odd numbers)

I'm not sure what you mean by this. The solution I'm using that does factorization (to compare with the combinatorial one) iterates through all numbers equal to 2 modulo 4 in the range (starting with 6); that's 25% of the numbers in the range. It only tries to factor the ones for which n + 1 and n/2 + 2 are prime, doing a lookup into the pre-computed primes, (and the factoring algorithm, besides using the pre-computed list of primes, exits immediately if it finds a repeated prime factor, so it only completely factors the squarefree numbers).

Vanadium 50 · Jun 18, 2019

What I mean is that so many numbers need to be tested by primality, there is likely little to be gained by stopping the sieve early and using primality testing (e.g. Adelman, Miller-Rabin) on any larger numbers.

scottdave · Sep 5, 2019

If I recall correctly, they say most Project Euler programs should be able to run in less than 2 minutes. This is where some thinking is necessary to find the 'better way'.

I'm just now reading through this and haven't read through all the suggestions yet, so this may have been suggested: could you build a set of numbers which have failed the test? Check to see if the next number to check is a multiple of one of those numbers.

Vanadium 50 · Sep 22, 2019

One last thought. Most Project Euler questions do not lend themselves well to parallelization, partly by design. This one does.

Improvement is 20%. The reason is, as discussed earlier, memory bandwidth is a limiting factor. More processes get us closer to that wall.

Vanadium 50 · Oct 11, 2019

Update update: switched from an i5-6500 (4C/4T) to a 3900X (12C/24T). So with 4-10x the computing power, the incremental improvement is...<drum roll>... 40%.

Sum of All Positive Integers n where d+n/d is Prime to 100M

Similar threads

Engineering Diff gain of a push pull degenerated differential pair

Bending Stress and Shear Stress

Engineering AGMA pitting resistance factor of safety (SH)

Engineering Full bridge circuit with inductor and resistor

How Do I Draw This Shear and Moment Diagram?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers