Can execution time of this C code be reduced?

jackmell · Dec 3, 2010

Hi. I've been working on a finite-difference problem in differential equations. I was wondering if this code can be further optimized, perhaps eliminate some if statements or assignment statements. That is, what is the most efficient way to code this or is it already written in the most efficient way? It goes through the matrix u(0..M-1, 0. .N-1) and for each u(i,j), computes:

u(i,j+1)-2u(i,j)+u(i,j-1)+u(i+1,j)-2u(i,j)+u(i-1,j)

except that when it gets to the end, it wraps around to the start and likewise when it's at the start, it wraps around to the end so that:

if i=0, then i-1 goes to M-1
if i=M-1, then i+1 goes to 0

if j=0, then j-1 goes to N-1
if j=N-1, then j+1 goes to 0

Here's the code taken from:

http://www.teemuleppanen.net/content/science/turing/turing_learning.shtml

Code:

for(i=0;i<M;i++)
      for(j=0;j<N;j++){
	lap[i][j] = 0;
	if(j!=N-1)
	  lap[i][j] += u[i][j+1]-u[i][j];
	if(j!=0)
	  lap[i][j] += u[i][j-1]-u[i][j];
	if(i!=0)
	  lap[i][j] += u[i-1][j]-u[i][j];
	if(i!=M-1)
	  lap[i][j] += u[i+1][j]-u[i][j];
	if(j==N-1)
	  lap[i][j] += u[i][0]-u[i][j];
	if(j==0)
	  lap[i][j] += u[i][N-1]-u[i][j];
	if(i==M-1)
	  lap[i][j] += u[0][j]-u[i][j];
	if(i==0)
	  lap[i][j] += u[M-1][j]-u[i][j];
	lap[i][j] = lap[i][j]/(dx*dx);
      }

Borek · Dec 3, 2010

Code that is not executed takes the least time. At first sight a tiny bit of optimization can be possible. Think if

Code:

if (i != 0) lap[i][j] += u[i-1][j]-u[i][j];
   else if (i == 0) lap[i][j] += u[M-1][j]-u[i][j];

is not faster.

What processor? Few generations back on Pentium it was better to use conditions that were more often true.

But that's just at first sight and from looking at the code only, could be there is more to be gained in the algorithm.

jackmell · Dec 3, 2010

Borek said:
Code that is not executed takes the least time. At first sight a tiny bit of optimization can be possible. Think if
Code:
if (i != 0) lap[i][j] += u[i-1][j]-u[i][j];
   else if (i == 0) lap[i][j] += u[M-1][j]-u[i][j];
is not faster.

What processor? Few generations back on Pentium it was better to use conditions that were more often true.

But that's just at first sight and from looking at the code only, could be there is more to be gained in the algorithm.

I don't think you need that other if in there huh?

Is there anyway to code this without any if's at all? I mean, can we create a set of pointers which automatically "wrap" around and just cycle through the pointers without having to check for the end of the array? For example, create a list:

myxpt=M-1,0,1,2,3,...M-2,M-1,0

and then just index into u like:

u[myxpt,myypt[j]]

Not sure though and maybe the extra indexing would take just as much time as the if statements.

Also, this code is being run many thousands of times and the u and v arrays could be large so even small improvements could make a nice difference in execution time.

I'm not against assembly but as I recall, C goes just as fast as assembly.

D H · Dec 3, 2010

jackmell said:

Code:

for(i=0;i<M;i++)
      for(j=0;j<N;j++){
	lap[i][j] = 0;
	if(j!=N-1)
	  lap[i][j] += u[i][j+1]-u[i][j];
	if(j!=0)
	  lap[i][j] += u[i][j-1]-u[i][j];
	if(i!=0)
	  lap[i][j] += u[i-1][j]-u[i][j];
	if(i!=M-1)
	  lap[i][j] += u[i+1][j]-u[i][j];
	if(j==N-1)
	  lap[i][j] += u[i][0]-u[i][j];
	if(j==0)
	  lap[i][j] += u[i][N-1]-u[i][j];
	if(i==M-1)
	  lap[i][j] += u[0][j]-u[i][j];
	if(i==0)
	  lap[i][j] += u[M-1][j]-u[i][j];
	lap[i][j] = lap[i][j]/(dx*dx);
      }

Yech.

First, a purely stylistic comment: Always use braces with loops and conditional statements. There is no difference in performance between

Code:

if (condition)
   do_something();

versus

Code:

if (condition) {
   do_something();
}

But enough with the soapboxing. So what other reasons for that "yech"?

You are making every decision twice. Your code will execute faster if you, for example, combine the if (j != 0) with the if (j == 0) code. Use the else statement.
You are making a lot of unnecessary decisions with respect to the outer loop variable i. Make those decisions outside the inner loop, capturing the results in some local variable.
That convoluted code is hiding what you are really doing here, which is a two-dimensional finite difference with wrapping at the edges of the array.

Putting this together, try

Code:

double dxsq = dx * dx;
for (i = 0; i < M; i++) {
   int ip = i > 0   ? i-1 : M-1;
   int in = i < M-1 ? i+1 : 0;
   for (j = 0; j < N; j++) {
      int jp = j > 0   ? j-1 : N-1;
      int jn = j < N-1 ? j+1 : 0;
      lap[i][j] = ((u[i][jp] - u[i][j]) +
                   (u[i][jn] - u[i][j]) +
                   (u[ip][j] - u[i][j]) +
                   (u[in][j] - u[i][j])) / dxsq;
   }
}

Now with the adjacency indices separated from the difference equation you can play around with different mechanisms for computing the same end result:

Which is faster: Branching to compute the adjacent indices, using the modulus operator, or caching those indices in an array?
Which is faster: The difference calculation as done above or something else such as
lap[j] = ((u[jp]+u[jn]+u[ip][j]+u[in][j])-4*u[j]) / dxsq;?

rcgldr · Dec 3, 2010

You could move the end point cases 0, M-1, N-1 to outside the loops, then eliminate all the conditionals in loops that go from i = 1 ... j = 1 to j < (N-1) to i < (M-1).

jackmell · Dec 3, 2010

Ok, thanks guys. I think caching the indexes in an array would be faster but not sure. I'll test your code.

Could you guys please tell me what I need to wrap around that code to compute the execution time? I assume it's a timer but I don't remember the syntax.

Thanks guys!

Also I just noticed something that DH noted: that dx*dx is floating-point multiplication that's being executed for each element in the array when really only one multiplication is necessary. That's a big improvement already and in fact this code is inside another loop that's being executed say 50000 times so I could remove that computation outside that loops and do it only once.

AlephZero · Dec 3, 2010

I'm not sure I would go quite as far as DH did, but you certainly want to take all the "boundary cases" and their if statements out of your main loop. Personally I would put them in two separate loops (one for the row boundaries, the other for the column boundaries).

In the DH-style statement that calculates lap[j], get rid of as many parentheses as you can. In standard C/C++, they are forcing the compiler to evalute the expression in the particular order defined by the parentheses, and (at least at first sight) there seems to be no reason why that is essential.

You can then save some typing by writing -4*u[j] instead of 4 separate subtractions of u[j], of course.

A different approach to all of this is:

* Make your arrays two elements longer in each direction,
* After each iteration, copy the "wrapped round" data into the extra elements (in a separate loop).
* The main loop then needs no logic and is simple for the compler to optimize. You might as well leave the subscripts as i+1, i-1, etc, to give the compiler an easy optimization task.

jackmell · Dec 3, 2010

Ok guys. That's good suggestions. I figured out how to time the loop. Need to include <time.h> in my code. Looks like I can do a start=time() at the start of the loop and end=time() at the end and the difference gives me clock cycles but I think it's just something close to milliseconds.

So for my baseline, I'll start with the code I posted above, run the loops for both the u data and v data on 100x100 arrays (with some other code also in the loop) and I'll do this 1000 times.

For the starting code above, I get 2600 clock cycles. However, just computing the dxsq=dx*dx outside of the loop and using dxsq in the routine, this drops the execution time down to about 1970.

I'll see just how much I can cut the total execution time by trying the various suggestions above.

Borek · Dec 3, 2010

jackmell said:

I don't think you need that other if in there huh?

Sorry, stupid mistake. What I was aiming at is that you have something like

Code:

if (a == something1) { do_something_1; }
if (a == something2) { do_something_2; }
if (a == something3) { do_something_3; }
if (a == something4) { do_something_4; }

it is obvious "a" can't be something1 and something2 at the same time. So after you have checked the first condition you don't have to check all others:

Code:

if (a == something1) { do_something_1; }
  else if (a == something2) { do_something_2; }
    else if (a == something3) { do_something_3; }
      else if (a == something4) { do_something_4; }

That's equivalent to the first point of the DH comments.

jackmell · Dec 3, 2010

I ran the timing with DH's code:

Code:

double dxsq = dx * dx;
for (i = 0; i < M; i++) {
   int ip = i > 0   ? i-1 : M-1;
   int in = i < M-1 ? i+1 : 0;
   for (j = 0; j < N; j++) {
      int jp = j > 0   ? j-1 : N-1;
      int jn = j < N-1 ? j+1 : 0;
      lap[i][j] = ((u[i][jp] - u[i][j]) +
                   (u[i][jn] - u[i][j]) +
                   (u[ip][j] - u[i][j]) +
                   (u[in][j] - u[i][j])) / dxsq;
   }
}

and obtained an average time of 1280 which I think is a nice improvement from the 1970 speed but that code I think is still testing a lot of conditionals which may not be necessary. I'll try to implement the other suggestions to see if I can reduce the execution time further.

I like rc's suggestion:

rcgldr said:

You could move the end point cases 0, M-1, N-1 to outside the loops, then eliminate all the conditionals in loops that go from i = 1 ... j = 1 to j < (N-1) to i < (M-1).

I think that means, suppose we have just a 1D array with u(10)=u(0. . . 9), then to eliminate all conditionals, I could use for the 1D Laplacian (in pseudo code):

lap(0)=u(1)-2u(0)+u(9)

for(i=1,i<=8,i++)
lap(i)=u(i+1)-2u(i)+u(i-1);

lap(9)=u(0)-2u(9)+u(8)

There, that's eliminated all the conditionals and I suppose I can implement that in 2D then in the program itself.

D H · Dec 3, 2010

AlephZero said:

In the DH-style statement that calculates lap[j], get rid of as many parentheses as you can. In standard C/C++, they are forcing the compiler to evalute the expression in the particular order defined by the parentheses, and (at least at first sight) there seems to be no reason why that is essential.

There are many cases where this is essential. It is important to remember that floating point representation is not exact. When you subtract two numbers that are "close" to one another you lose quite a bit of precision. When you add two numbers that are close to one another you lose a little bit of precision. Do this with four numbers and you lose even more. Now subtract 4*u[j] and the result might be a considerably less precise than that obtained using the forced grouping I used.

That said, I did suggest that alternate scheme, lap[j] = (u[jp]+u[jn]+u[ip][j]+u[in][j]-4*u[j]) / dxsq, at the end of post #4.

A different approach to all of this is make the arrays two elements longer in each direction [and] copy the "wrapped round" data into the extra elements.

That is a very good approach, particularly when you tell the compiler to optimize the code. Optimizers typically don't know what to do with those "wrapped round" indices, whether computed by an if statement, using modulus, or an array lookup. The optimizer can go to town with simple indices such as [j], [i+1][j], [i-1][j], [j+1], and [j-1].

jackmell · Dec 3, 2010

Hey guys. Those are all good suggestions. The code below is what I believe is the implementation eliminating all conditionals. It's a lot longer but reduces the execution time down to an average of 1020 (a little more than one second) which is still a nice improvement over the 1280. However, I'm not sure at this point if I have the logic correct. I do the four corners separately, the interior points of the perimeter separately, then the internal block (minus the perimeter). That's a total of 9 separate constructs.

Can anyone suggest a way that might reduce it further?

Code:

/* bottom left corner */

	  lap[0][0]=(u[0][1]-4*u[0][0]+u[0][M-1]+u[1][0]+u[N-1][0])/dxsq;

/* interior of bottom row */
	for(j=1;j<=M-2;j++)
	{
		lap[0][j]=(u[0][j+1]-4*u[0][j]+u[0][j-1]+u[1][j]+u[N-1][j])/dxsq;
	}

/* bottom right corner */
	lap[0][M-1]=(u[0][0]-4*u[0][M-1]+u[0][M-2]+u[1][M-1]+u[N-1][M-1])/dxsq;
	
/* internal column of left vertical strip */
	for(i=1;i<=N-2;i++)
	{
		lap[i][0]=(u[i][1]-4*u[i][0]+u[i][M-1]+u[i+1][0]+u[i-1][0])/dxsq;
	}

/* internal column of right verticle strip */
	for(i=1;i<=N-2;i++)
	{
		lap[i][M-1]=(u[i][0]-4*u[i][M-1]+u[i][M-2]+u[i+1][M-1]+u[i-1][M-1])/dxsq;
	}

/* top left corner */
	lap[N-1][0]=(u[N-1][1]-4*u[N-1][0]+u[N-1][M-1]+u[0][0]+u[N-2][0])/dxsq;

/* top right corner */
	lap[N-1][M-1]=(u[N-1][0]-4*u[N-1][M-1]+u[N-1][M-2]+u[0][M-1]+u[N-2][M-1])/dxsq;

/* internal row of top row */
	for(j=1;j<=M-2;j++)
	{
		lap[N-1][j]=(u[N-1][j+1]-4*u[N-1][j]+u[N-1][j-1]+u[0][j]+u[N-2][j])/dxsq;
	}

/* internal block */
	for(i=2;i<=N-2;i++)
		for(j=2;j<=M-2;j++)
	{
		lap[i][j]=(u[i][j+1]-4*u[i][j]+u[i][j-1]+u[i+1][j]+u[i-1][j])/dxsq;
	}

Can execution time of this C code be reduced?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

PHP My website presents the visitor with the choice of opting out of using cookies....

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect