Sum of All Positive Integers n where d+n/d is Prime to 100M

  • Thread starter Thread starter Arman777
  • Start date Start date
  • Tags Tags
    Euler
Click For Summary
The discussion focuses on solving Project Euler Problem 357, which involves finding the sum of all positive integers n not exceeding 100 million such that for every divisor d of n, the expression d + n/d is prime. Participants share code snippets for generating divisors and testing for primality, noting that the current implementation is slow for larger numbers. Suggestions include using the Sieve of Eratosthenes to generate a set of primes for faster membership testing, as set lookups are significantly quicker than list lookups. Profiling the code is emphasized to identify bottlenecks and optimize performance. The conversation highlights the importance of efficient algorithms in computational problems.
  • #61
PeterDonis said:
This is very disappointing and confusing to me, as while Python is expected to be slow, it's not supposed to be that slow, nor is it supposed to invert one's usual intuition about things like, as you put it, factorization is slow but multiplication is fast.
That bugs me too. Since yesterday I am trying to use Numba and PyPy however I couldn't find a way to run them. Maybe you can find a way to speed things by using Numba, PyPy or CPython
 
Last edited:
Physics news on Phys.org
  • #62
PeterDonis said:
This is very disappointing and confusing to me, as while Python is expected to be slow, it's not supposed to be that slow, nor is it supposed to invert one's usual intuition about things like, as you put it, factorization is slow but multiplication is fast.

Can you try another implementation of Python? I agree with you that this doesn't look right.

I suspect some inefficiencies in memory access. My reasoning behind this is based on the fact that running my code on machines with faster CPU but slower memory slows it down. So, if the fact that there are 100MB long arrays wasn't enough of a tip-off that memory access would be important, measurements confirm it.
 
  • Like
Likes Arman777
  • #63
Vanadium 50 said:
Can you try another implementation of Python?

I don't have any others handy, and anyway I personally am more interested in figuring out what's going on with the standard CPython implementation.

Vanadium 50 said:
I suspect some inefficiencies in memory access.

That's a possibility, yes, but if that's what's going on, I don't see how to fix it, since I'm not doing anything unusual with the Python code.

For example, here is the function that corresponds to your n = 4 case as far as generating the combinations; it's only yielding 3-tuples because it's only iterating over the primes from 3 on (that's what the primes parameter is, a list of those primes); the function that calls this one takes care of multiplying by 2 and testing the particular combination to see if it belongs in the result set. This function, and its counterparts for the other sizes of combinations, is where my code is spending most of its time. (The q parameter is n divided by 2, so it will be 50,000,000 for the n = 10^8 case. The prod function just multiplies together all items in an iterable and returns the product.)

Python:
def _factor_gen_3(primes, q, debug=True):
    for i, a in enumerate(primes[:-2]):
        if a * prod(primes[i + 1:i + 3]) >= q:
            break
        for j, b in enumerate(primes[i + 1:-1]):
            if a * b * prod(primes[i + j + 2:i + j + 3]) >= q:
                break
            for c in takewhile(lambda p: a * b * p < q, primes[i + j + 2:]):
                yield (a, b, c)
 
  • #64
I don't see why that should be slow. The lambda is a bit unusual, but there's no reason it should be slower than an ordinary function.

Maybe a clue would be to try other values of n (100,000,000 in the question asked) and try and infer the computational complexity. Is the slowness O(n), or O(n2) or something else?
 
  • #65
Vanadium 50 said:
The lambda is a bit unusual

It's just easier than having to define a separate function for something so simple that only appears as a predicate in takewhile. As you say, it shouldn't make any difference to speed.

Vanadium 50 said:
Is the slowness O(n), or O(n2) or something else?

Roughly quadratic from the comparisons I've been able to make.
 
  • #66
PeterDonis said:
Roughly quadratic from the comparisons I've been able to make.

Well, the sieve itself is O(n log log n). So if you are seeing quadratic behavior, it's not the algorithm: it's that executing the python internals is taking longer than it should. Again, I would guess memory, just because it's easy to imagine how the second memory allocation might take more than the first.
 
  • #67
Vanadium 50 said:
the sieve itself is O(n log log n). So if you are seeing quadratic behavior, it's not the algorithm

I'm not counting the time it takes to pre-compute the list of primes using the sieve; I'm just talking about the time it takes to generate the combinations of prime factors once the list of primes is pre-computed.

Since the combination generator functions use nested for loops, we would expect them to be quadratic or worse if it weren't for the loop truncations. (After all, a straight count of ##r## element combinations taken from ##N## elements total goes like ##N^r## for large ##N##.) But the loop truncations should limit the number of combinations to roughly linear in ##N##--at least I think so, since basically we're just generating all squarefree numbers less than ##N##, which is roughly linear in ##N##.
 
  • #68
PeterDonis said:
loop truncations should limit the number of combinations to roughly linear in N

Right. \frac{6}{\pi^2}N if you want to be precise. But fundamentally it's got to be O(N) since you are testing (most) of the numbers, just in a different order.

Same point as above, though. The algorithm can't be doing this to you. It's got to be something internal to your Python implementation.
 
  • #69
PeterDonis said:
Roughly quadratic from the comparisons I've been able to make.

After further testing, it looks closer to ##N \log N##.

Vanadium 50 said:
The algorithm can't be doing this to you. It's got to be something internal to your Python implementation.

I'm afraid so, yes.
 
  • #70
If it's NlogN, I wonder if it allocates memory in non-contiguous blocks: perhaps a B-tree. The NlogN would come from it traversing the tree rather than just incrementing a counter.

Python does garbage collection automatically, right? If so, maybe its a side effect of keeping track of objects that gets overwhelmed by these giant arrays. That would also explain why I don't see it; I made C++ give me contiguous arrays.
 
  • #71
Vanadium 50 said:
I wonder if it allocates memory in non-contiguous blocks

It could, although I'm not sure how that would affect the performance of simple nested for loops over slices of a list that's not being mutated. All of the variable accesses at the C level inside the interpreter should be just pointers.

Vanadium 50 said:
Python does garbage collection automatically, right?

Yes. But disabling automatic gc doesn't affect the execution time, so this doesn't seem to be the issue. (Mostly Python deallocates objects based on their reference count going to 0; the automatic gc is for cases that can't be handled that simply, like circular references. I don't think there are any of those in my code.)
 
  • #72
Vanadium 50 said:
perhaps a B-tree. The NlogN would come from it traversing the tree

If you mean the interpreter has to do this when allocating small blocks of memory for new objects, yes, this is one of my prime suspects at this point, since one thing the for loops can't avoid doing is allocating new objects on every iteration to do the tests and yield the combinations.
 
  • #73
I am actually thinking if it has very small memory blocks, it needs to figure out where everything is even when reading. But what I can't figure out is why anyone would do this on a modern CPU.
 
  • #74
Vanadium 50 said:
I am actually thinking if it has very small memory blocks, it needs to figure out where everything is even when reading.

Python objects at the C level are just pointers to data structures. There are variable name lookups, which in the general case are hash table lookups, but for local variables inside functions those are optimized to array accesses.
 
  • #75
OK, I give up. But since the behavior is crazy, any explanation of it is bound to be a little crazy too.
 
  • #76
Some final thoughts:

Can I do better than one second? Maybe a little. As I said, memory access is the bottleneck. A bool is a byte, so I need 100 MB to store my prime array. With clever packing, I can get a factor of at least 16. But how much can I speed up? Well, the prime-finding is 800 ms, and about half of that is memory access, and maybe I can double the speed of that, so there's 200 ms to be gained. About 20%. There are slightly faster` sieves, so maybe I can get a 30% savings, but probably not much more.

Looking at it from the other direction, there's probably 150 ms of memory access that's unavoidable, and similarly 100 ms of arithmetic that's unavoidable, so I am within a factor of maybe 4 of what the hardware can do. That's pretty good.

To do much faster, one needs a different algorithm. An exhaustive test, even one that works with factors, needs to test almost every number in the range for primality. (61% of the odd numbers) That suggests I need to sieve. And sieving already dominates my time. What I am thinking of is a) if there is one solution, can I use it to find another solution? b) is there some way of generating solutions from "seeds" - other numbers from which a solution can be derived? c) if I exclude a candidate solution, is there some way to infer that another candidate can be excluded? Otherwise I would be nibbling around the edges.
 
  • #77
Vanadium 50 said:
An exhaustive test, even one that works with factors, needs to test almost every number in the range for primality. (61% of the odd numbers)

I'm not sure what you mean by this. The solution I'm using that does factorization (to compare with the combinatorial one) iterates through all numbers equal to 2 modulo 4 in the range (starting with 6); that's 25% of the numbers in the range. It only tries to factor the ones for which n + 1 and n/2 + 2 are prime, doing a lookup into the pre-computed primes, (and the factoring algorithm, besides using the pre-computed list of primes, exits immediately if it finds a repeated prime factor, so it only completely factors the squarefree numbers).
 
  • #78
What I mean is that so many numbers need to be tested by primality, there is likely little to be gained by stopping the sieve early and using primality testing (e.g. Adelman, Miller-Rabin) on any larger numbers.
 
  • #79
If I recall correctly, they say most Project Euler programs should be able to run in less than 2 minutes. This is where some thinking is necessary to find the 'better way'.

I'm just now reading through this and haven't read through all the suggestions yet, so this may have been suggested: could you build a set of numbers which have failed the test? Check to see if the next number to check is a multiple of one of those numbers.
 
  • #80
One last thought. Most Project Euler questions do not lend themselves well to parallelization, partly by design. This one does.

Improvement is 20%. The reason is, as discussed earlier, memory bandwidth is a limiting factor. More processes get us closer to that wall.
 
Last edited:
  • #81
Update update: switched from an i5-6500 (4C/4T) to a 3900X (12C/24T). So with 4-10x the computing power, the incremental improvement is...<drum roll>... 40%.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 19 ·
Replies
19
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 28 ·
Replies
28
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
Replies
4
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 97 ·
4
Replies
97
Views
9K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 14 ·
Replies
14
Views
4K