# Convert a FOR loop to parallel processes

• Python
Hi, I know that Python supports parallel programming. I think it can be used in my case but not sure how. Basically I have this for loop

Python:
out = []
for value in vec:
out.append(func(value))

where func is some function that takes value as an argument. The different values in vec are independent, and can be used in parallel to calculate func(value). I am relatively new to Python, so, how can I convert this simple for loop to a code that will process the different independent values in parallel to speed up the results?

StoneTemplePython
Gold Member
I think you're putting the cart in front of the horse here. If you are running a vanilla for-loop in regular Python and/or using built-in lists, you are going way slower than you need to. If you told me that this was part of code running under Cython, numba or maybe in pypy, I would feel different. Even if you weren't using any of those, you could most or all of this with broadcasting in numpy with a speedup possibly of 100x +.

Parallel processing in Python is not very easy, whereas the stuff mentioned above is, relatively, straightforward.

I am running a code that takes days. I want to speed it up, and I thought maybe parallel processing is the way to go, but wasn't certain. How broadcasting will speed up the code 100x? Could you elaborate more?

StoneTemplePython
Gold Member
I am running a code that takes days. I want to speed it up, and I thought maybe parallel processing is the way to go, but wasn't certain. How broadcasting will speed up the code 100x? Could you elaborate more?
That was my read on this.

For numerics, I really like using Numpy. Broadcasting or more generally "vectorizing" code pushes it through numpy and gives you, in essence, C speeds.

Vectorization techniques are pretty standard (and old) for interpreted languages. The basic idea is to minimize the involvement of the interpreter. You may want to watch the below to get started:

- - - -
The alternatives are:

Numba plays well with Numpy but not much else, and takes some skill to wield -- I like to augment with Numba though doing so makes the whole vectorization technique unnecessary.

Cython is one step in between C and Python -- fast but uglier and clunky.

These days, where possible, I'd skip all this and just do the code in Julia (whose syntax will remind you of matlab and python).

Last edited:
jedishrfu
Mentor
I think he’s referring to a distributed computing scheme where your tasks are distributed across several computers and then aggregated together when complete. Apache Spark as an example, can do this sort of thing.

Yes, Julia is a great way to go as it has packages that support distributed computing and works well with python too. See the site

https://lectures.quantecon.org/jl/index.html

for more on this stuff.

@StoneTemplePython Thanks. I will check the video out, but for now, how can I use vectorization in my for loop example to speed it up?

@jedishrfu Yes, I want to distribute the process, but not on multiple computers as I don't have the resources, but rather on multiple threads (I think the term is multithreading or multiproceses in other languages like C++). I don't have to wait the result for iteration k, to start iteration k+1. I can do all of them in parallel, and then combine the results in out list.

I could run multiple Jupyter notebooks (or any other editors), each with one point/value from the for loop, and then combine them manually at the end, but I don't think this the most efficient way.

Another question: if I implemented the for loop in C++ as it is in Python, would the process be (noticeably) faster?

StoneTemplePython
Gold Member
@StoneTemplePython Thanks. I will check the video out, but for now, how can I use vectorization in my for loop example to speed it up?
I'm not sure what else to tell you here -- it's a technique that takes some work to get good at. The nice thing about python is there are tons of fully worked examples in blogs and stackedoverflow.

Matlab users historically had to get good at this technique too -- I'm a touch surprised you're not familiar with it. In very recent years, I think, Matlab has added some features to over-ride slow interpreter driven performance (i.e. to compete with the likes of Julia) -- specifically the term I'm thinking of it Just-in-Time Compiled

Another question: if I implemented the for loop in C++ as it is in Python, would the process be (noticeably) faster?
very much

I'm not sure what else to tell you here -- it's a technique that takes some work to get good at. The nice thing about python is there are tons of fully worked examples in blogs and stackedoverflow.

Matlab users historically had to get good at this technique too -- I'm a touch surprised you're not familiar with it. In very recent years, I think, Matlab has added some features to over-ride slow interpreter driven performance (i.e. to compete with the likes of Julia).

very much

I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks

StoneTemplePython
Gold Member
I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks

If you're ok with the tedium, doing it directly in C++ should blow (most) everything else out of the water with respect to minimizing run-times...

Klystron and jedishrfu
If you're ok with the tedium, doing it directly in C++ should blow (most) everything else out of the water with respect to minimizing run-times...

I will give it a try.

jedishrfu
jedishrfu
Mentor

The idea is to keep the CPU running while other hardware handles things in the background.

A simple example is to imagine yourself as a chess master/CPU playing several games/threads at the same time. Once you make a move/IO you switch to the next game/thread while your opponent ponders what to do next. In that way you are thinking constantly high CPU usage and can juggle games efficiently.

Mark44
Mentor
I'm not sure this is necessarily true in this era of processors with multiple cores. If you can break up the operations into separate threads, each running on its own core, you can greatly speed up the calculations. Modern C++ (i.e., C++11 or C++14) has both low-level and high-level thread operations for parallelizing a program.

Klystron and jedishrfu
Unbelievable. I ran the same code in Python and C++. The C++ code took around 8 seconds to finish, while the same code in Python took more than 240 seconds!! This is a speed factor of 30!! Still C++ takes time for the hard cases, but this is a good sign that it will finish faster. I am sure I can speed it up more with more efficient programming. I will investigate this if the hard cases still take days.

StoneTemplePython
Gold Member
Unbelievable. I ran the same code in Python and C++. The C++ code took around 8 seconds to finish, while the same code in Python took more than 240 seconds!! This is a speed factor of 30!! Still C++ takes time for the hard cases, but this is a good sign that it will finish faster. I am sure I can speed it up more with more efficient programming. I will investigate this if the hard cases still take days.

Don't be too surprised -- if you want speed in Python, you need to use some of the tricks / techniques I mentioned earlier or push everything down to built-in functions from libraries -- almost all of those are actually coded in C or C++ (for very good reason)

Based on some simulations and number crunching I've done, I think a ##\gt 100x## multiple is not uncommon at all

That 'lose your loops' video will give you a flavor of why this happens

jedishrfu
I have never found C++ difficult to learn (I haven't practiced it regularly though, so I had to search some syntax, which was quick), but the de facto program in our field is MATLAB. I think MATLAB and Python are better than C++ when it comes to visualization, which is not an issue for me, as I can copy and paste the results from the C++ program to a MATLAB file, and plot it there.

jedishrfu
Mentor
I'm not sure this is necessarily true in this era of processors with multiple cores. If you can break up the operations into separate threads, each running on its own core, you can greatly speed up the calculations. Modern C++ (i.e., C++11 or C++14) has both low-level and high-level thread operations for parallelizing a program.
Yes you are correct. Multi-core changes the game somewhat as a mashup of distributed and multi-threaded with shared memory access and no network bandwidth issues.

Don't be too surprised -- if you want speed in Python, you need to use some of the tricks / techniques I mentioned earlier or push everything down to built-in functions from libraries -- almost all of those are actually coded in C or C++ (for very good reason)

Based on some simulations and number crunching I've done, I think a ##\gt 100x## multiple is not uncommon at all

That 'lose your loops' video will give you a flavor of why this happens

I am trying to evaluate a numerical integration, and I need to make the step size very small to get accurate evaluation. With a step size of 0.01 there was a huge difference in theory, but you don't feel it is significant because Python took about 4 minutes to finish. But when I changed the step size to 0.001, the C++ program finished in 4 minutes, while the Python program is still running after more than 45 minutes. If the 30x holds true, then it is expected for the Python program to finish in 2 hours (total)!! I am definitely switching to C++. I am glad I opened this thread to come to this realization.

The idea is to keep the CPU running while other hardware handles things in the background.

A simple example is to imagine yourself as a chess master/CPU playing several games/threads at the same time. Once you make a move/IO you switch to the next game/thread while your opponent ponders what to do next. In that way you are thinking constantly high CPU usage and can juggle games efficiently.

I am not sure if I get this, but the idea is that since you have to evaluate the same function on different values the don't depend on each others, why cannot you do it in parallel? Multicore CPUs was in my mind when I asked the question, because if you have one CPU, and you rotate in round-robin fashion over different processes by executing a part of each process each time instead of finishing a whole process and then move to the next, the end result in terms of time would be the same.

jedishrfu
Mentor
I am not sure if I get this, but the idea is that since you have to evaluate the same function on different values the don't depend on each others, why cannot you do it in parallel? Multicore CPUs was in my mind when I asked the question, because if you have one CPU, and you rotate in round-robin fashion over different processes by executing a part of each process each time instead of finishing a whole process and then move to the next, the end result in terms of time would be the same.

With a single cpu there is no parallellism, there is instead thread switching. IO operations are a convenient time to do a thread switch or a task switch since the thread can’t continue until the data is read or written. When it’s just computation then the cpu decides how much time to give a thread before it switches to the next thread.

A related idea is shared data where the threads block if the executing thread is updating the shared item.

Multicores try to speed the threading strategy up by minimizing the thread switching overhead and the shared memory access blocking as well as true parallel operation on each core.

Here’s a better description of how cores work, using cached memory and other shared resources

https://en.m.wikipedia.org/wiki/Multi-core_processor

Last edited:
EngWiPy
Staff Emeritus
2021 Award
With a single cpu there is no parallellism

This is the key. The corollary to this is that if you have a 4 core CPU, a single-threaded program will use only 25% of the available CPU, so there is a potential factor of 4 gain, but only a factor of 4.

If it were me, I'd see what C/C++ gains I could get, what algorithmic gains I could get, and then when I was done, go after the factor of 4 with OpenMP.

jedishrfu
I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks
What matters most is the overhead of python. python lists can contain different types, and the interpreter has to check every time a list element is accessed. The list elements will also be pointers to memory.
A numpy array has only one pointer to all of the data, and all elements of the array are the same type. There is still a lot of overhead if you use small arrays.

I did a few speed tests.
Time is in nanoseconds/iteration
Code:
array
size    1000    10000    100000    1000000

python    70       70         70        71
numpy     4       0.9         0.7       1.3
c        0.45      0.46       0.53      0.7

Python is a single for loop
Code:
sum = 0
for (i in range len(a)):
sum += a[i]
return sum

This takes 70 ns for every iteration, indepedent of how big the array is. this is about 160 clockcycles of my laptop, and the processor can do up to 4 instructions in a clockcycle.

numpy just calls numpy.sum() on a numpy array. It will run 50 to 70 times faster for large arrays. Smaller arrays stil have a lot of python-overhead. Even for an array of size 1000 the python overhead will take more time than doing the work. For really large arrays the cost of memory reads wich no longer fit in the caches will become important.

The c program needs some tricks to go at maximum speed.
The floating point adder of intel processers (since at least the pentium) can start a new add every clockcycle, but there is a delay of 4 cycles until the output becomes available. if you use the same loop as in python there will be a 4 clockcycle delay for every iteration, because every iteration depends on the previous iteration. You can however collect the sum in 4 different variables, and the loop will run 4 times as fast. 0.45 ns/iteration is nearly 1 clockcycle/iteration on my 2.3 GHz processor.
Code:
    for (int cnt = 0; cnt < REP; cnt++) {
double sum1 = 0;
double sum2 = 0;
double sum3 = 0;
double sum4 = 0;

for (int i = 0; i < SZ; i += 4) {
sum1 += a[i];
sum2 += a[i + 1];
sum3 += a[i + 2];
sum4 += a[i + 3];
}
double sum = sum1 + sum2 + sum3 + sum4;
}

StoneTemplePython and jedishrfu
Thanks for the technical details. Python is easy to use, and more user-friendly, but at the cost of more overhead. Are you saying that numpy in Python can compete with C++ speed-wise? Not sure if C++ supports vectorization or something similar.

I can see how dividing the summation can make a difference in the C++ code, but what if inside the for loop you call a function whose result doesn't add to the result of other iterations? How to speed up this for loop? Something like this

Code:
int vec[4] = {0.2, 0.21, 0.3, 0.5}
for (int i =0; i < 4; i++){
cout << func(vec[i]) << endl;
}

float func(float u){
return //some operations
}

Last edited:
Thanks for the technical details. Python is easy to use, and more user-friendly, but at the cost of more overhead. Are you saying that numpy in Python can compete with C++ speed-wise? Not sure if C++ supports vectorization or something similar.
C++ compilers can auto-vectorize loops, and sometimes use SSE intstructions to do more than one addition at a time with one instruction.
You'll need to avoid data dependencies like the usage of a single variable for sum, pointers, and you will need to make func an inline function. Much will depend on the compiler and options.
It also will be very hard work. Using numpy is much easier.

I can see how dividing the summation can make a difference in the C++ code, but what if inside the for loop you call a function whose result doesn't add to the result of other iterations? How to speed up this for loop? Something like this
Code:
int vec[4] = {0.2, 0.21, 0.3, 0.5}
for (int i =0; i < 4; i++){
cout << func(vec[i]) << endl;
It depends on what that function is. Divisions, square roots, or hard to predict conditional jumps are bad. Table lookups are good. (if it fits in the cache)
If the iterations are completely independent, than the processor can work on more than 1 iteration at the same time, even with only one core.

I think MATLAB and Python are better than C++ when it comes to visualization
Well, it depends on what you mean by 'better'. From C++ you can basically use a lot of libraries that python uses for visualization.

In the worst case scenario, you can write the results into a file and use gnuplot for displaying, but you can also use some library for visualization. I like VTK for that: https://www.vtk.org/

You can do a lot of things with it, from 2D charts (example on my blog: https://compphys.go.ro/empirical-pseudopotential/ ) to volumetric visualization (another example: https://compphys.go.ro/dft-for-a-quantum-dot/ ).

...
If the iterations are completely independent, than the processor can work on more than 1 iteration at the same time, even with only one core.

Well, it depends on what you mean by 'better'. From C++ you can basically use a lot of libraries that python uses for visualization.

In the worst case scenario, you can write the results into a file and use gnuplot for displaying, but you can also use some library for visualization. I like VTK for that: https://www.vtk.org/

You can do a lot of things with it, from 2D charts (example on my blog: https://compphys.go.ro/empirical-pseudopotential/ ) to volumetric visualization (another example: https://compphys.go.ro/dft-for-a-quantum-dot/ ).

By better, I meant you have more flexibility and it is easier to use. Maybe I am wrong, because I am not very familiar with C++ and its full capabilities and libraries.

Mark44
Mentor
By better, I meant you have more flexibility and it is easier to use. Maybe I am wrong, because I am not very familiar with C++ and its full capabilities and libraries.
But as you're finding out, more flexibility and ease of use come at a price -- much slower execution times. C++ doesn't come with a lot of libraries, particularly for graphics and such, but there are third party libraries, such as the ones mentioned by @aaroman, that you can use to obtain images from the data your program creates.

StoneTemplePython
No, just pipelining and execution of more than one instruction at the time. If all the instructions are independent, a single cpu can do many things at a time. If you execute something like
for (i=0; i<1000; i++) A[ i] = A[ i] - B[ i] * c; The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.

Last edited by a moderator:
But as you're finding out, more flexibility and ease of use come at a price -- much slower execution times. C++ doesn't come with a lot of libraries, particularly for graphics and such, but there are third party libraries, such as the ones mentioned by @aaroman, that you can use to obtain images from the data your program creates.

In plotting there is no computational processing usually, and Python and MATLAB do it instantly if you have the vectors you want to plot, at least in my case. I can appreciate that C++ is much faster when it comes to reducing computational time, where it can reduce the execution time from days to hours, or from hours to minutes. In that case it is worth exploring C++ more, because there is a real and significant advantage. I am not saying to C++ cannot do a good job in visualization, but in my case, I don't have to venture to that territory, because I think I don't have to.

No, just pipelining and execution of more than one instruction at the time. If all the instructions are independent, a single cpu can do many things at a time. If you execute something like
for (i=0; i<1000; i++) for (i=0; i<1000; i++) A[ i] = A[ i] - B[ i] * c; The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.

OK, I see. But we have no control over this. I mean this is a hardware design architecture how the CPU executes different instructions, because they are done at different units. But what if you are executing the same function on independent data, but using the same instructions?

Last edited by a moderator:
Mark44
Mentor
The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.
OK, I see. But we have no control over this. I mean this is a hardware design architecture how the CPU executes different instructions, because they are done at different units. But what if you are executing the same function on independent data, but using the same instructions?
We have some control over this, which is what optimizations using loop unrolling or loop unwinding are about.
The processor executes instructions in one or more pipelines, and tries to guess what the next instruction will be. If it guesses correctly, everything is fine, since that instruction is in the pipeline. If it guesses wrong, it has to flush the pipeline, which takes several clock cycles to refill.

Loops such as for or while loops can be problematic, as are branches such as if and if ... else.

Here's a simple example from this wiki article: https://en.wikipedia.org/wiki/Loop_unrolling
C:
int x;
for (x = 0; x < 100; x++)
{
delete(x);
}

The same loop, after unrolling:
C:
int x;
for (x = 0; x < 100; x += 5 )
{
delete(x);
delete(x + 1);
delete(x + 2);
delete(x + 3);
delete(x + 4);
}

Klystron