Convert a FOR loop to parallel processes

EngWiPy · Jul 24, 2018

Hi, I know that Python supports parallel programming. I think it can be used in my case but not sure how. Basically I have this for loop

Python:

out = []
for value in vec:
       out.append(func(value))

where func is some function that takes value as an argument. The different values in vec are independent, and can be used in parallel to calculate func(value). I am relatively new to Python, so, how can I convert this simple for loop to a code that will process the different independent values in parallel to speed up the results?

Thanks in advance

StoneTemplePython · Jul 24, 2018

I think you're putting the cart in front of the horse here. If you are running a vanilla for-loop in regular Python and/or using built-in lists, you are going way slower than you need to. If you told me that this was part of code running under Cython, numba or maybe in pypy, I would feel different. Even if you weren't using any of those, you could most or all of this with broadcasting in numpy with a speedup possibly of 100x +.

Parallel processing in Python is not very easy, whereas the stuff mentioned above is, relatively, straightforward.

EngWiPy · Jul 24, 2018

I am running a code that takes days. I want to speed it up, and I thought maybe parallel processing is the way to go, but wasn't certain. How broadcasting will speed up the code 100x? Could you elaborate more?

StoneTemplePython · Jul 24, 2018

EngWiPy said:

I am running a code that takes days. I want to speed it up, and I thought maybe parallel processing is the way to go, but wasn't certain. How broadcasting will speed up the code 100x? Could you elaborate more?

That was my read on this.

For numerics, I really like using Numpy. Broadcasting or more generally "vectorizing" code pushes it through numpy and gives you, in essence, C speeds.

Vectorization techniques are pretty standard (and old) for interpreted languages. The basic idea is to minimize the involvement of the interpreter. You may want to watch the below to get started:

- - - -
The alternatives are:

Numba plays well with Numpy but not much else, and takes some skill to wield -- I like to augment with Numba though doing so makes the whole vectorization technique unnecessary.

Cython is one step in between C and Python -- fast but uglier and clunky.

These days, where possible, I'd skip all this and just do the code in Julia (whose syntax will remind you of MATLAB and python).

jedishrfu · Jul 24, 2018

I think he’s referring to a distributed computing scheme where your tasks are distributed across several computers and then aggregated together when complete. Apache Spark as an example, can do this sort of thing.

Yes, Julia is a great way to go as it has packages that support distributed computing and works well with python too. See the site

https://lectures.quantecon.org/jl/index.html

for more on this stuff.

EngWiPy · Jul 24, 2018

@StoneTemplePython Thanks. I will check the video out, but for now, how can I use vectorization in my for loop example to speed it up?

@jedishrfu Yes, I want to distribute the process, but not on multiple computers as I don't have the resources, but rather on multiple threads (I think the term is multithreading or multiproceses in other languages like C++). I don't have to wait the result for iteration k, to start iteration k+1. I can do all of them in parallel, and then combine the results in out list.

I could run multiple Jupyter notebooks (or any other editors), each with one point/value from the for loop, and then combine them manually at the end, but I don't think this the most efficient way.

Another question: if I implemented the for loop in C++ as it is in Python, would the process be (noticeably) faster?

StoneTemplePython · Jul 24, 2018

EngWiPy said:

@StoneTemplePython Thanks. I will check the video out, but for now, how can I use vectorization in my for loop example to speed it up?

I'm not sure what else to tell you here -- it's a technique that takes some work to get good at. The nice thing about python is there are tons of fully worked examples in blogs and stackedoverflow.

Matlab users historically had to get good at this technique too -- I'm a touch surprised you're not familiar with it. In very recent years, I think, Matlab has added some features to over-ride slow interpreter driven performance (i.e. to compete with the likes of Julia) -- specifically the term I'm thinking of it Just-in-Time Compiled

EngWiPy said:

Another question: if I implemented the for loop in C++ as it is in Python, would the process be (noticeably) faster?

very much

EngWiPy · Jul 24, 2018

StoneTemplePython said:

I'm not sure what else to tell you here -- it's a technique that takes some work to get good at. The nice thing about python is there are tons of fully worked examples in blogs and stackedoverflow.

Matlab users historically had to get good at this technique too -- I'm a touch surprised you're not familiar with it. In very recent years, I think, Matlab has added some features to over-ride slow interpreter driven performance (i.e. to compete with the likes of Julia).
very much

I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks

StoneTemplePython · Jul 24, 2018

EngWiPy said:

I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks

If you're ok with the tedium, doing it directly in C++ should blow (most) everything else out of the water with respect to minimizing run-times...

EngWiPy · Jul 24, 2018

StoneTemplePython said:

If you're ok with the tedium, doing it directly in C++ should blow (most) everything else out of the water with respect to minimizing run-times...

I will give it a try.

jedishrfu · Jul 24, 2018

Multi threaded approach only works when each thread has IO to do otherwise it would be just switching from thread to thread.

The idea is to keep the CPU running while other hardware handles things in the background.

A simple example is to imagine yourself as a chess master/CPU playing several games/threads at the same time. Once you make a move/IO you switch to the next game/thread while your opponent ponders what to do next. In that way you are thinking constantly high CPU usage and can juggle games efficiently.

Mark44 · Jul 24, 2018

jedishrfu said:

Multi threaded approach only works when each thread has IO to do otherwise it would be just switching from thread to thread.

I'm not sure this is necessarily true in this era of processors with multiple cores. If you can break up the operations into separate threads, each running on its own core, you can greatly speed up the calculations. Modern C++ (i.e., C++11 or C++14) has both low-level and high-level thread operations for parallelizing a program.

EngWiPy · Jul 24, 2018

Unbelievable. I ran the same code in Python and C++. The C++ code took around 8 seconds to finish, while the same code in Python took more than 240 seconds! This is a speed factor of 30! Still C++ takes time for the hard cases, but this is a good sign that it will finish faster. I am sure I can speed it up more with more efficient programming. I will investigate this if the hard cases still take days.

StoneTemplePython · Jul 24, 2018

EngWiPy said:

Unbelievable. I ran the same code in Python and C++. The C++ code took around 8 seconds to finish, while the same code in Python took more than 240 seconds! This is a speed factor of 30! Still C++ takes time for the hard cases, but this is a good sign that it will finish faster. I am sure I can speed it up more with more efficient programming. I will investigate this if the hard cases still take days.

Don't be too surprised -- if you want speed in Python, you need to use some of the tricks / techniques I mentioned earlier or push everything down to built-in functions from libraries -- almost all of those are actually coded in C or C++ (for very good reason)

Based on some simulations and number crunching I've done, I think a ##\gt 100x## multiple is not uncommon at all

That 'lose your loops' video will give you a flavor of why this happens

EngWiPy · Jul 24, 2018

I have never found C++ difficult to learn (I haven't practiced it regularly though, so I had to search some syntax, which was quick), but the de facto program in our field is MATLAB. I think MATLAB and Python are better than C++ when it comes to visualization, which is not an issue for me, as I can copy and paste the results from the C++ program to a MATLAB file, and plot it there.

jedishrfu · Jul 24, 2018

Mark44 said:

I'm not sure this is necessarily true in this era of processors with multiple cores. If you can break up the operations into separate threads, each running on its own core, you can greatly speed up the calculations. Modern C++ (i.e., C++11 or C++14) has both low-level and high-level thread operations for parallelizing a program.

Yes you are correct. Multi-core changes the game somewhat as a mashup of distributed and multi-threaded with shared memory access and no network bandwidth issues.

EngWiPy · Jul 24, 2018

StoneTemplePython said:

Don't be too surprised -- if you want speed in Python, you need to use some of the tricks / techniques I mentioned earlier or push everything down to built-in functions from libraries -- almost all of those are actually coded in C or C++ (for very good reason)

Based on some simulations and number crunching I've done, I think a ##\gt 100x## multiple is not uncommon at all

That 'lose your loops' video will give you a flavor of why this happens

I am trying to evaluate a numerical integration, and I need to make the step size very small to get accurate evaluation. With a step size of 0.01 there was a huge difference in theory, but you don't feel it is significant because Python took about 4 minutes to finish. But when I changed the step size to 0.001, the C++ program finished in 4 minutes, while the Python program is still running after more than 45 minutes. If the 30x holds true, then it is expected for the Python program to finish in 2 hours (total)! I am definitely switching to C++. I am glad I opened this thread to come to this realization.

EngWiPy · Jul 24, 2018

jedishrfu said:

Multi threaded approach only works when each thread has IO to do otherwise it would be just switching from thread to thread.

The idea is to keep the CPU running while other hardware handles things in the background.

A simple example is to imagine yourself as a chess master/CPU playing several games/threads at the same time. Once you make a move/IO you switch to the next game/thread while your opponent ponders what to do next. In that way you are thinking constantly high CPU usage and can juggle games efficiently.

I am not sure if I get this, but the idea is that since you have to evaluate the same function on different values the don't depend on each others, why cannot you do it in parallel? Multicore CPUs was in my mind when I asked the question, because if you have one CPU, and you rotate in round-robin fashion over different processes by executing a part of each process each time instead of finishing a whole process and then move to the next, the end result in terms of time would be the same.

jedishrfu · Jul 24, 2018

EngWiPy said:

I am not sure if I get this, but the idea is that since you have to evaluate the same function on different values the don't depend on each others, why cannot you do it in parallel? Multicore CPUs was in my mind when I asked the question, because if you have one CPU, and you rotate in round-robin fashion over different processes by executing a part of each process each time instead of finishing a whole process and then move to the next, the end result in terms of time would be the same.

With a single cpu there is no parallellism, there is instead thread switching. IO operations are a convenient time to do a thread switch or a task switch since the thread can’t continue until the data is read or written. When it’s just computation then the cpu decides how much time to give a thread before it switches to the next thread.

A related idea is shared data where the threads block if the executing thread is updating the shared item.

Multicores try to speed the threading strategy up by minimizing the thread switching overhead and the shared memory access blocking as well as true parallel operation on each core.

Here’s a better description of how cores work, using cached memory and other shared resources

https://en.m.wikipedia.org/wiki/Multi-core_processor

Vanadium 50 · Jul 25, 2018

jedishrfu said:

With a single cpu there is no parallellism

This is the key. The corollary to this is that if you have a 4 core CPU, a single-threaded program will use only 25% of the available CPU, so there is a potential factor of 4 gain, but only a factor of 4.

If it were me, I'd see what C/C++ gains I could get, what algorithmic gains I could get, and then when I was done, go after the factor of 4 with OpenMP.

willem2 · Jul 25, 2018

EngWiPy said:

I am familiar with vectorization in built-in functions in MATLAB like sum(1:10), but I haven't implemented a user-defined function that processes a vector (I don't know the underlying processes that makes sum(1:10) faster than a successive addition of the numbers between 1 and 10 in a for loop). I have to look this up. I may also consider switching to C++ if it is much faster than Python. Speed in important in my case. Thanks

What matters most is the overhead of python. python lists can contain different types, and the interpreter has to check every time a list element is accessed. The list elements will also be pointers to memory.
A numpy array has only one pointer to all of the data, and all elements of the array are the same type. There is still a lot of overhead if you use small arrays.

I did a few speed tests.
Time is in nanoseconds/iteration

Code:

array
size    1000    10000    100000    1000000

python    70       70         70        71
numpy     4       0.9         0.7       1.3
c        0.45      0.46       0.53      0.7

Python is a single for loop

Code:

sum = 0
for (i in range len(a)):
    sum += a[i]
return sum

This takes 70 ns for every iteration, indepedent of how big the array is. this is about 160 clockcycles of my laptop, and the processor can do up to 4 instructions in a clockcycle.

numpy just calls numpy.sum() on a numpy array. It will run 50 to 70 times faster for large arrays. Smaller arrays stil have a lot of python-overhead. Even for an array of size 1000 the python overhead will take more time than doing the work. For really large arrays the cost of memory reads which no longer fit in the caches will become important.

The c program needs some tricks to go at maximum speed.
The floating point adder of intel processers (since at least the pentium) can start a new add every clockcycle, but there is a delay of 4 cycles until the output becomes available. if you use the same loop as in python there will be a 4 clockcycle delay for every iteration, because every iteration depends on the previous iteration. You can however collect the sum in 4 different variables, and the loop will run 4 times as fast. 0.45 ns/iteration is nearly 1 clockcycle/iteration on my 2.3 GHz processor.

Code:

    for (int cnt = 0; cnt < REP; cnt++) {
    double sum1 = 0;
    double sum2 = 0;
    double sum3 = 0;
    double sum4 = 0;

        for (int i = 0; i < SZ; i += 4) {
            sum1 += a[i];
            sum2 += a[i + 1];
            sum3 += a[i + 2];
            sum4 += a[i + 3];
        }
        double sum = sum1 + sum2 + sum3 + sum4;
    }

EngWiPy · Jul 25, 2018

Thanks for the technical details. Python is easy to use, and more user-friendly, but at the cost of more overhead. Are you saying that numpy in Python can compete with C++ speed-wise? Not sure if C++ supports vectorization or something similar.

I can see how dividing the summation can make a difference in the C++ code, but what if inside the for loop you call a function whose result doesn't add to the result of other iterations? How to speed up this for loop? Something like this

Code:

int vec[4] = {0.2, 0.21, 0.3, 0.5}
for (int i =0; i < 4; i++){
count << func(vec[i]) << endl;
}

float func(float u){
return //some operations
}

willem2 · Jul 25, 2018

EngWiPy said:

Thanks for the technical details. Python is easy to use, and more user-friendly, but at the cost of more overhead. Are you saying that numpy in Python can compete with C++ speed-wise? Not sure if C++ supports vectorization or something similar.

C++ compilers can auto-vectorize loops, and sometimes use SSE intstructions to do more than one addition at a time with one instruction.
You'll need to avoid data dependencies like the usage of a single variable for sum, pointers, and you will need to make func an inline function. Much will depend on the compiler and options.
It also will be very hard work. Using numpy is much easier.

I can see how dividing the summation can make a difference in the C++ code, but what if inside the for loop you call a function whose result doesn't add to the result of other iterations? How to speed up this for loop? Something like this

Code:

int vec[4] = {0.2, 0.21, 0.3, 0.5}
for (int i =0; i < 4; i++){
cout << func(vec[i]) << endl;

It depends on what that function is. Divisions, square roots, or hard to predict conditional jumps are bad. Table lookups are good. (if it fits in the cache)
If the iterations are completely independent, than the processor can work on more than 1 iteration at the same time, even with only one core.

aaroman · Jul 26, 2018

I think MATLAB and Python are better than C++ when it comes to visualization

Well, it depends on what you mean by 'better'. From C++ you can basically use a lot of libraries that python uses for visualization.

In the worst case scenario, you can write the results into a file and use gnuplot for displaying, but you can also use some library for visualization. I like VTK for that: https://www.vtk.org/

You can do a lot of things with it, from 2D charts (example on my blog: https://compphys.go.ro/empirical-pseudopotential/ ) to volumetric visualization (another example: https://compphys.go.ro/dft-for-a-quantum-dot/ ).

EngWiPy · Jul 26, 2018

willem2 said:

...
If the iterations are completely independent, than the processor can work on more than 1 iteration at the same time, even with only one core.

Are you talking about multithreading?

EngWiPy · Jul 26, 2018

aaroman said:

Well, it depends on what you mean by 'better'. From C++ you can basically use a lot of libraries that python uses for visualization.

In the worst case scenario, you can write the results into a file and use gnuplot for displaying, but you can also use some library for visualization. I like VTK for that: https://www.vtk.org/

You can do a lot of things with it, from 2D charts (example on my blog: https://compphys.go.ro/empirical-pseudopotential/ ) to volumetric visualization (another example: https://compphys.go.ro/dft-for-a-quantum-dot/ ).

By better, I meant you have more flexibility and it is easier to use. Maybe I am wrong, because I am not very familiar with C++ and its full capabilities and libraries.

Mark44 · Jul 26, 2018

EngWiPy said:

By better, I meant you have more flexibility and it is easier to use. Maybe I am wrong, because I am not very familiar with C++ and its full capabilities and libraries.

But as you're finding out, more flexibility and ease of use come at a price -- much slower execution times. C++ doesn't come with a lot of libraries, particularly for graphics and such, but there are third party libraries, such as the ones mentioned by @aaroman, that you can use to obtain images from the data your program creates.

willem2 · Jul 26, 2018

EngWiPy said:

Are you talking about multithreading?

No, just pipelining and execution of more than one instruction at the time. If all the instructions are independent, a single cpu can do many things at a time. If you execute something like
for (i=0; i<1000; i++) A[ i] = A[ i] - B[ i] * c; The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.

EngWiPy · Jul 26, 2018

Mark44 said:

But as you're finding out, more flexibility and ease of use come at a price -- much slower execution times. C++ doesn't come with a lot of libraries, particularly for graphics and such, but there are third party libraries, such as the ones mentioned by @aaroman, that you can use to obtain images from the data your program creates.

In plotting there is no computational processing usually, and Python and MATLAB do it instantly if you have the vectors you want to plot, at least in my case. I can appreciate that C++ is much faster when it comes to reducing computational time, where it can reduce the execution time from days to hours, or from hours to minutes. In that case it is worth exploring C++ more, because there is a real and significant advantage. I am not saying to C++ cannot do a good job in visualization, but in my case, I don't have to venture to that territory, because I think I don't have to.

EngWiPy · Jul 26, 2018

willem2 said:

No, just pipelining and execution of more than one instruction at the time. If all the instructions are independent, a single cpu can do many things at a time. If you execute something like
for (i=0; i<1000; i++) for (i=0; i<1000; i++) A[ i] = A[ i] - B[ i] * c; The cpu can have dozens of instructions waiting for other instructions or memory in the pipeline at the same time. The cpu will have no problems starting a multiply, a subtraction, a load (2 loads for the newest types) and a store in one clock cycle, even if these belong to different iterations of the loop.

OK, I see. But we have no control over this. I mean this is a hardware design architecture how the CPU executes different instructions, because they are done at different units. But what if you are executing the same function on independent data, but using the same instructions?

Convert a FOR loop to parallel processes

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

PHP My website presents the visitor with the choice of opting out of using cookies....

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect