Convert a FOR loop to parallel processes

In summary: C and just call it from within python.In summary, the conversation discusses various ways to speed up a code that takes days to run. The suggestion is to use vectorization techniques in Numpy to minimize the involvement of the interpreter and achieve faster speeds. Other options mentioned include using Numba or Cython, or switching to Julia which has packages that support distributed computing. The possibility of implementing the code in C++ is also discussed. The conversation ends with the suggestion of learning how to get at the underlying python array structure and writing the code in fortran or C to call it from within python.
  • #1
EngWiPy
1,368
61
Hi, I know that Python supports parallel programming. I think it can be used in my case but not sure how. Basically I have this for loop

Python:
out = []
for value in vec:
       out.append(func(value))

where func is some function that takes value as an argument. The different values in vec are independent, and can be used in parallel to calculate func(value). I am relatively new to Python, so, how can I convert this simple for loop to a code that will process the different independent values in parallel to speed up the results?

Thanks in advance
 
Technology news on Phys.org
  • #2
I think you're putting the cart in front of the horse here. If you are running a vanilla for-loop in regular Python and/or using built-in lists, you are going way slower than you need to. If you told me that this was part of code running under Cython, numba or maybe in pypy, I would feel different. Even if you weren't using any of those, you could most or all of this with broadcasting in numpy with a speedup possibly of 100x +.

Parallel processing in Python is not very easy, whereas the stuff mentioned above is, relatively, straightforward.
 
  • #3
I am running a code that takes days. I want to speed it up, and I thought maybe parallel processing is the way to go, but wasn't certain. How broadcasting will speed up the code 100x? Could you elaborate more?
 
  • #4
EngWiPy said:
I am running a code that takes days. I want to speed it up, and I thought maybe parallel processing is the way to go, but wasn't certain. How broadcasting will speed up the code 100x? Could you elaborate more?
That was my read on this.

For numerics, I really like using Numpy. Broadcasting or more generally "vectorizing" code pushes it through numpy and gives you, in essence, C speeds.

Vectorization techniques are pretty standard (and old) for interpreted languages. The basic idea is to minimize the involvement of the interpreter. You may want to watch the below to get started:


- - - -
The alternatives are:

Numba plays well with Numpy but not much else, and takes some skill to wield -- I like to augment with Numba though doing so makes the whole vectorization technique unnecessary.

Cython is one step in between C and Python -- fast but uglier and clunky.

These days, where possible, I'd skip all this and just do the code in Julia (whose syntax will remind you of MATLAB and python).
 
Last edited:
  • #5
I think he’s referring to a distributed computing scheme where your tasks are distributed across several computers and then aggregated together when complete. Apache Spark as an example, can do this sort of thing.

Yes, Julia is a great way to go as it has packages that support distributed computing and works well with python too. See the site

https://lectures.quantecon.org/jl/index.html

for more on this stuff.
 
  • #6
@StoneTemplePython Thanks. I will check the video out, but for now, how can I use vectorization in my for loop example to speed it up?

@jedishrfu Yes, I want to distribute the process, but not on multiple computers as I don't have the resources, but rather on multiple threads (I think the term is multithreading or multiproceses in other languages like C++). I don't have to wait the result for iteration k, to start iteration k+1. I can do all of them in parallel, and then combine the results in out list.

I could run multiple Jupyter notebooks (or any other editors), each with one point/value from the for loop, and then combine them manually at the end, but I don't think this the most efficient way.

Another question: if I implemented the for loop in C++ as it is in Python, would the process be (noticeably) faster?
 
  • #7
EngWiPy said:
@StoneTemplePython Thanks. I will check the video out, but for now, how can I use vectorization in my for loop example to speed it up?
I'm not sure what else to tell you here -- it's a technique that takes some work to get good at. The nice thing about python is there are tons of fully worked examples in blogs and stackedoverflow.

Matlab users historically had to get good at this technique too -- I'm a touch surprised you're not familiar with it. In very recent years, I think, Matlab has added some features to over-ride slow interpreter driven performance (i.e. to compete with the likes of Julia) -- specifically the term I'm thinking of it Just-in-Time Compiled
EngWiPy said:
Another question: if I implemented the for loop in C++ as it is in Python, would the process be (noticeably) faster?
very much
 
  • #8
StoneTemplePython said:
I'm not sure what else to tell you here -- it's a technique that takes some work to get good at. The nice thing about python is there are tons of fully worked examples in blogs and stackedoverflow.

Matlab users historically had to get good at this technique too -- I'm a touch surprised you're not familiar with it. In very recent years, I think, Matlab has added some features to over-ride slow interpreter driven performance (i.e. to compete with the likes of Julia).
very much

I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks
 
  • #9
EngWiPy said:
I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks

If you're ok with the tedium, doing it directly in C++ should blow (most) everything else out of the water with respect to minimizing run-times...
 
  • Like
Likes Klystron and jedishrfu
  • #10
StoneTemplePython said:
If you're ok with the tedium, doing it directly in C++ should blow (most) everything else out of the water with respect to minimizing run-times...

I will give it a try.
 
  • Like
Likes jedishrfu
  • #11
Multi threaded approach only works when each thread has IO to do otherwise it would be just switching from thread to thread.

The idea is to keep the CPU running while other hardware handles things in the background.

A simple example is to imagine yourself as a chess master/CPU playing several games/threads at the same time. Once you make a move/IO you switch to the next game/thread while your opponent ponders what to do next. In that way you are thinking constantly high CPU usage and can juggle games efficiently.
 
  • #12
jedishrfu said:
Multi threaded approach only works when each thread has IO to do otherwise it would be just switching from thread to thread.
I'm not sure this is necessarily true in this era of processors with multiple cores. If you can break up the operations into separate threads, each running on its own core, you can greatly speed up the calculations. Modern C++ (i.e., C++11 or C++14) has both low-level and high-level thread operations for parallelizing a program.
 
  • Like
Likes Klystron and jedishrfu
  • #13
Unbelievable. I ran the same code in Python and C++. The C++ code took around 8 seconds to finish, while the same code in Python took more than 240 seconds! This is a speed factor of 30! Still C++ takes time for the hard cases, but this is a good sign that it will finish faster. I am sure I can speed it up more with more efficient programming. I will investigate this if the hard cases still take days.
 
  • #14
EngWiPy said:
Unbelievable. I ran the same code in Python and C++. The C++ code took around 8 seconds to finish, while the same code in Python took more than 240 seconds! This is a speed factor of 30! Still C++ takes time for the hard cases, but this is a good sign that it will finish faster. I am sure I can speed it up more with more efficient programming. I will investigate this if the hard cases still take days.

Don't be too surprised -- if you want speed in Python, you need to use some of the tricks / techniques I mentioned earlier or push everything down to built-in functions from libraries -- almost all of those are actually coded in C or C++ (for very good reason)

Based on some simulations and number crunching I've done, I think a ##\gt 100x## multiple is not uncommon at all

That 'lose your loops' video will give you a flavor of why this happens
 
  • Like
Likes jedishrfu
  • #15
I have never found C++ difficult to learn (I haven't practiced it regularly though, so I had to search some syntax, which was quick), but the de facto program in our field is MATLAB. I think MATLAB and Python are better than C++ when it comes to visualization, which is not an issue for me, as I can copy and paste the results from the C++ program to a MATLAB file, and plot it there.
 
  • #16
Mark44 said:
I'm not sure this is necessarily true in this era of processors with multiple cores. If you can break up the operations into separate threads, each running on its own core, you can greatly speed up the calculations. Modern C++ (i.e., C++11 or C++14) has both low-level and high-level thread operations for parallelizing a program.
Yes you are correct. Multi-core changes the game somewhat as a mashup of distributed and multi-threaded with shared memory access and no network bandwidth issues.
 
  • #17
StoneTemplePython said:
Don't be too surprised -- if you want speed in Python, you need to use some of the tricks / techniques I mentioned earlier or push everything down to built-in functions from libraries -- almost all of those are actually coded in C or C++ (for very good reason)

Based on some simulations and number crunching I've done, I think a ##\gt 100x## multiple is not uncommon at all

That 'lose your loops' video will give you a flavor of why this happens

I am trying to evaluate a numerical integration, and I need to make the step size very small to get accurate evaluation. With a step size of 0.01 there was a huge difference in theory, but you don't feel it is significant because Python took about 4 minutes to finish. But when I changed the step size to 0.001, the C++ program finished in 4 minutes, while the Python program is still running after more than 45 minutes. If the 30x holds true, then it is expected for the Python program to finish in 2 hours (total)! I am definitely switching to C++. I am glad I opened this thread to come to this realization.
 
  • #18
jedishrfu said:
Multi threaded approach only works when each thread has IO to do otherwise it would be just switching from thread to thread.

The idea is to keep the CPU running while other hardware handles things in the background.

A simple example is to imagine yourself as a chess master/CPU playing several games/threads at the same time. Once you make a move/IO you switch to the next game/thread while your opponent ponders what to do next. In that way you are thinking constantly high CPU usage and can juggle games efficiently.

I am not sure if I get this, but the idea is that since you have to evaluate the same function on different values the don't depend on each others, why cannot you do it in parallel? Multicore CPUs was in my mind when I asked the question, because if you have one CPU, and you rotate in round-robin fashion over different processes by executing a part of each process each time instead of finishing a whole process and then move to the next, the end result in terms of time would be the same.
 
  • #19
EngWiPy said:
I am not sure if I get this, but the idea is that since you have to evaluate the same function on different values the don't depend on each others, why cannot you do it in parallel? Multicore CPUs was in my mind when I asked the question, because if you have one CPU, and you rotate in round-robin fashion over different processes by executing a part of each process each time instead of finishing a whole process and then move to the next, the end result in terms of time would be the same.

With a single cpu there is no parallellism, there is instead thread switching. IO operations are a convenient time to do a thread switch or a task switch since the thread can’t continue until the data is read or written. When it’s just computation then the cpu decides how much time to give a thread before it switches to the next thread.

A related idea is shared data where the threads block if the executing thread is updating the shared item.

Multicores try to speed the threading strategy up by minimizing the thread switching overhead and the shared memory access blocking as well as true parallel operation on each core.

Here’s a better description of how cores work, using cached memory and other shared resources

https://en.m.wikipedia.org/wiki/Multi-core_processor
 
Last edited:
  • Like
Likes EngWiPy
  • #20
jedishrfu said:
With a single cpu there is no parallellism

This is the key. The corollary to this is that if you have a 4 core CPU, a single-threaded program will use only 25% of the available CPU, so there is a potential factor of 4 gain, but only a factor of 4.

If it were me, I'd see what C/C++ gains I could get, what algorithmic gains I could get, and then when I was done, go after the factor of 4 with OpenMP.
 
  • Like
Likes jedishrfu
  • #21
EngWiPy said:
I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks
What matters most is the overhead of python. python lists can contain different types, and the interpreter has to check every time a list element is accessed. The list elements will also be pointers to memory.
A numpy array has only one pointer to all of the data, and all elements of the array are the same type. There is still a lot of overhead if you use small arrays.

I did a few speed tests.
Time is in nanoseconds/iteration
Code:
array
size    1000    10000    100000    1000000

python    70       70         70        71
numpy     4       0.9         0.7       1.3
c        0.45      0.46       0.53      0.7

Python is a single for loop
Code:
sum = 0
for (i in range len(a)):
    sum += a[i]
return sum

This takes 70 ns for every iteration, indepedent of how big the array is. this is about 160 clockcycles of my laptop, and the processor can do up to 4 instructions in a clockcycle.

numpy just calls numpy.sum() on a numpy array. It will run 50 to 70 times faster for large arrays. Smaller arrays stil have a lot of python-overhead. Even for an array of size 1000 the python overhead will take more time than doing the work. For really large arrays the cost of memory reads which no longer fit in the caches will become important.

The c program needs some tricks to go at maximum speed.
The floating point adder of intel processers (since at least the pentium) can start a new add every clockcycle, but there is a delay of 4 cycles until the output becomes available. if you use the same loop as in python there will be a 4 clockcycle delay for every iteration, because every iteration depends on the previous iteration. You can however collect the sum in 4 different variables, and the loop will run 4 times as fast. 0.45 ns/iteration is nearly 1 clockcycle/iteration on my 2.3 GHz processor.
Code:
    for (int cnt = 0; cnt < REP; cnt++) {
    double sum1 = 0;
    double sum2 = 0;
    double sum3 = 0;
    double sum4 = 0;

        for (int i = 0; i < SZ; i += 4) {
            sum1 += a[i];
            sum2 += a[i + 1];
            sum3 += a[i + 2];
            sum4 += a[i + 3];
        }
        double sum = sum1 + sum2 + sum3 + sum4;
    }

 
  • Like
Likes StoneTemplePython and jedishrfu
  • #22
Thanks for the technical details. Python is easy to use, and more user-friendly, but at the cost of more overhead. Are you saying that numpy in Python can compete with C++ speed-wise? Not sure if C++ supports vectorization or something similar.

I can see how dividing the summation can make a difference in the C++ code, but what if inside the for loop you call a function whose result doesn't add to the result of other iterations? How to speed up this for loop? Something like this

Code:
int vec[4] = {0.2, 0.21, 0.3, 0.5}
for (int i =0; i < 4; i++){
cout << func(vec[i]) << endl;
}

float func(float u){
return //some operations
}
 
Last edited:
  • #23
EngWiPy said:
Thanks for the technical details. Python is easy to use, and more user-friendly, but at the cost of more overhead. Are you saying that numpy in Python can compete with C++ speed-wise? Not sure if C++ supports vectorization or something similar.
C++ compilers can auto-vectorize loops, and sometimes use SSE intstructions to do more than one addition at a time with one instruction.
You'll need to avoid data dependencies like the usage of a single variable for sum, pointers, and you will need to make func an inline function. Much will depend on the compiler and options.
It also will be very hard work. Using numpy is much easier.

I can see how dividing the summation can make a difference in the C++ code, but what if inside the for loop you call a function whose result doesn't add to the result of other iterations? How to speed up this for loop? Something like this
Code:
int vec[4] = {0.2, 0.21, 0.3, 0.5}
for (int i =0; i < 4; i++){
cout << func(vec[i]) << endl;
It depends on what that function is. Divisions, square roots, or hard to predict conditional jumps are bad. Table lookups are good. (if it fits in the cache)
If the iterations are completely independent, than the processor can work on more than 1 iteration at the same time, even with only one core.
 
  • #24
I think MATLAB and Python are better than C++ when it comes to visualization
Well, it depends on what you mean by 'better'. From C++ you can basically use a lot of libraries that python uses for visualization.

In the worst case scenario, you can write the results into a file and use gnuplot for displaying, but you can also use some library for visualization. I like VTK for that: https://www.vtk.org/

You can do a lot of things with it, from 2D charts (example on my blog: https://compphys.go.ro/empirical-pseudopotential/ ) to volumetric visualization (another example: https://compphys.go.ro/dft-for-a-quantum-dot/ ).
 
  • #25
willem2 said:
...
If the iterations are completely independent, than the processor can work on more than 1 iteration at the same time, even with only one core.

Are you talking about multithreading?
 
  • #26
aaroman said:
Well, it depends on what you mean by 'better'. From C++ you can basically use a lot of libraries that python uses for visualization.

In the worst case scenario, you can write the results into a file and use gnuplot for displaying, but you can also use some library for visualization. I like VTK for that: https://www.vtk.org/

You can do a lot of things with it, from 2D charts (example on my blog: https://compphys.go.ro/empirical-pseudopotential/ ) to volumetric visualization (another example: https://compphys.go.ro/dft-for-a-quantum-dot/ ).

By better, I meant you have more flexibility and it is easier to use. Maybe I am wrong, because I am not very familiar with C++ and its full capabilities and libraries.
 
  • #27
EngWiPy said:
By better, I meant you have more flexibility and it is easier to use. Maybe I am wrong, because I am not very familiar with C++ and its full capabilities and libraries.
But as you're finding out, more flexibility and ease of use come at a price -- much slower execution times. C++ doesn't come with a lot of libraries, particularly for graphics and such, but there are third party libraries, such as the ones mentioned by @aaroman, that you can use to obtain images from the data your program creates.
 
  • Like
Likes StoneTemplePython
  • #28
EngWiPy said:
Are you talking about multithreading?
No, just pipelining and execution of more than one instruction at the time. If all the instructions are independent, a single cpu can do many things at a time. If you execute something like
for (i=0; i<1000; i++) A[ i] = A[ i] - B[ i] * c; The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.
 
Last edited by a moderator:
  • #29
Mark44 said:
But as you're finding out, more flexibility and ease of use come at a price -- much slower execution times. C++ doesn't come with a lot of libraries, particularly for graphics and such, but there are third party libraries, such as the ones mentioned by @aaroman, that you can use to obtain images from the data your program creates.

In plotting there is no computational processing usually, and Python and MATLAB do it instantly if you have the vectors you want to plot, at least in my case. I can appreciate that C++ is much faster when it comes to reducing computational time, where it can reduce the execution time from days to hours, or from hours to minutes. In that case it is worth exploring C++ more, because there is a real and significant advantage. I am not saying to C++ cannot do a good job in visualization, but in my case, I don't have to venture to that territory, because I think I don't have to.
 
  • #30
willem2 said:
No, just pipelining and execution of more than one instruction at the time. If all the instructions are independent, a single cpu can do many things at a time. If you execute something like
for (i=0; i<1000; i++) for (i=0; i<1000; i++) A[ i] = A[ i] - B[ i] * c; The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.

OK, I see. But we have no control over this. I mean this is a hardware design architecture how the CPU executes different instructions, because they are done at different units. But what if you are executing the same function on independent data, but using the same instructions?
 
Last edited by a moderator:
  • #31
willem2 said:
The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.
EngWiPy said:
OK, I see. But we have no control over this. I mean this is a hardware design architecture how the CPU executes different instructions, because they are done at different units. But what if you are executing the same function on independent data, but using the same instructions?
We have some control over this, which is what optimizations using loop unrolling or loop unwinding are about.
The processor executes instructions in one or more pipelines, and tries to guess what the next instruction will be. If it guesses correctly, everything is fine, since that instruction is in the pipeline. If it guesses wrong, it has to flush the pipeline, which takes several clock cycles to refill.

Loops such as for or while loops can be problematic, as are branches such as if and if ... else.

Here's a simple example from this wiki article: https://en.wikipedia.org/wiki/Loop_unrolling
C:
int x;
for (x = 0; x < 100; x++)
{
     delete(x);
}

The same loop, after unrolling:
C:
int x;
for (x = 0; x < 100; x += 5 )
{
     delete(x);
     delete(x + 1);
     delete(x + 2);
     delete(x + 3);
     delete(x + 4);
}
 
  • #32
  • Like
Likes Klystron

1. What is the purpose of converting a FOR loop to parallel processes?

The purpose of converting a FOR loop to parallel processes is to improve the efficiency and speed of the loop. By dividing the loop into smaller tasks that can be executed simultaneously, the overall execution time is reduced, resulting in faster processing of data.

2. How does parallel processing differ from a traditional FOR loop?

In a traditional FOR loop, each iteration is executed sequentially, meaning that the loop must complete one iteration before moving on to the next. In parallel processing, multiple iterations can be executed at the same time, allowing for faster execution and improved performance.

3. What are the potential benefits of converting a FOR loop to parallel processes?

The main benefit of converting a FOR loop to parallel processes is improved performance and efficiency. This can be especially beneficial when dealing with large datasets or complex calculations. Additionally, parallel processing can also help to reduce the overall execution time of a program, making it more time-efficient.

4. Are there any limitations to using parallel processes instead of a FOR loop?

Yes, there are some limitations to using parallel processes. One limitation is that not all tasks can be easily divided into smaller, parallel tasks. Additionally, there may be added complexity in managing and coordinating the parallel processes, which can make debugging more difficult.

5. What are some common techniques for converting a FOR loop to parallel processes?

There are several techniques for converting a FOR loop to parallel processes, including using multithreading, multiprocessing, and distributed computing. Each technique has its own advantages and considerations, so it is important to carefully evaluate which approach is best suited for a particular task or program.

Similar threads

  • Programming and Computer Science
Replies
8
Views
3K
  • Programming and Computer Science
Replies
11
Views
1K
  • Programming and Computer Science
Replies
11
Views
2K
  • Programming and Computer Science
Replies
4
Views
383
  • Programming and Computer Science
Replies
5
Views
1K
  • Programming and Computer Science
Replies
16
Views
1K
  • Programming and Computer Science
Replies
4
Views
941
  • Programming and Computer Science
Replies
8
Views
2K
  • Programming and Computer Science
Replies
4
Views
883
  • Programming and Computer Science
Replies
2
Views
1K
Back
Top