Can execution time of this C code be reduced?

In summary, the conversation discusses optimizing a code for a finite-difference problem in differential equations. The code goes through a matrix and computes a specific equation for each element, with wrapping around the edges of the matrix. The experts suggest various ways to optimize the code, such as combining if statements and using modulus operators, as well as caching indices in an array. They also note that using a timer can help compute the execution time and that removing unnecessary floating-point multiplications can improve the code's efficiency.
  • #1
jackmell
1,807
54
Hi. I've been working on a finite-difference problem in differential equations. I was wondering if this code can be further optimized, perhaps eliminate some if statements or assignment statements. That is, what is the most efficient way to code this or is it already written in the most efficient way? It goes through the matrix u(0..M-1, 0. .N-1) and for each u(i,j), computes:

u(i,j+1)-2u(i,j)+u(i,j-1)+u(i+1,j)-2u(i,j)+u(i-1,j)

except that when it gets to the end, it wraps around to the start and likewise when it's at the start, it wraps around to the end so that:

if i=0, then i-1 goes to M-1
if i=M-1, then i+1 goes to 0

if j=0, then j-1 goes to N-1
if j=N-1, then j+1 goes to 0

Here's the code taken from:

http://www.teemuleppanen.net/content/science/turing/turing_learning.shtml

Code:
for(i=0;i<M;i++)
      for(j=0;j<N;j++){
	lap[i][j] = 0;
	if(j!=N-1)
	  lap[i][j] += u[i][j+1]-u[i][j];
	if(j!=0)
	  lap[i][j] += u[i][j-1]-u[i][j];
	if(i!=0)
	  lap[i][j] += u[i-1][j]-u[i][j];
	if(i!=M-1)
	  lap[i][j] += u[i+1][j]-u[i][j];
	if(j==N-1)
	  lap[i][j] += u[i][0]-u[i][j];
	if(j==0)
	  lap[i][j] += u[i][N-1]-u[i][j];
	if(i==M-1)
	  lap[i][j] += u[0][j]-u[i][j];
	if(i==0)
	  lap[i][j] += u[M-1][j]-u[i][j];
	lap[i][j] = lap[i][j]/(dx*dx);
      }
 
Last edited:
Technology news on Phys.org
  • #2
Code that is not executed takes the least time. At first sight a tiny bit of optimization can be possible. Think if

Code:
if (i != 0) lap[i][j] += u[i-1][j]-u[i][j];
   else if (i == 0) lap[i][j] += u[M-1][j]-u[i][j];

is not faster.

What processor? Few generations back on Pentium it was better to use conditions that were more often true.

But that's just at first sight and from looking at the code only, could be there is more to be gained in the algorithm.
 
  • #3
Borek said:
Code that is not executed takes the least time. At first sight a tiny bit of optimization can be possible. Think if

Code:
if (i != 0) lap[i][j] += u[i-1][j]-u[i][j];
   else if (i == 0) lap[i][j] += u[M-1][j]-u[i][j];

is not faster.

What processor? Few generations back on Pentium it was better to use conditions that were more often true.

But that's just at first sight and from looking at the code only, could be there is more to be gained in the algorithm.

I don't think you need that other if in there huh?

Is there anyway to code this without any if's at all? I mean, can we create a set of pointers which automatically "wrap" around and just cycle through the pointers without having to check for the end of the array? For example, create a list:

myxpt=M-1,0,1,2,3,...M-2,M-1,0

and then just index into u like:

u[myxpt,myypt[j]]

Not sure though and maybe the extra indexing would take just as much time as the if statements.

Also, this code is being run many thousands of times and the u and v arrays could be large so even small improvements could make a nice difference in execution time.

I'm not against assembly but as I recall, C goes just as fast as assembly.
 
Last edited:
  • #4
jackmell said:
Code:
for(i=0;i<M;i++)
      for(j=0;j<N;j++){
	lap[i][j] = 0;
	if(j!=N-1)
	  lap[i][j] += u[i][j+1]-u[i][j];
	if(j!=0)
	  lap[i][j] += u[i][j-1]-u[i][j];
	if(i!=0)
	  lap[i][j] += u[i-1][j]-u[i][j];
	if(i!=M-1)
	  lap[i][j] += u[i+1][j]-u[i][j];
	if(j==N-1)
	  lap[i][j] += u[i][0]-u[i][j];
	if(j==0)
	  lap[i][j] += u[i][N-1]-u[i][j];
	if(i==M-1)
	  lap[i][j] += u[0][j]-u[i][j];
	if(i==0)
	  lap[i][j] += u[M-1][j]-u[i][j];
	lap[i][j] = lap[i][j]/(dx*dx);
      }

Yech.

First, a purely stylistic comment: Always use braces with loops and conditional statements. There is no difference in performance between

Code:
if (condition)
   do_something();

versus

Code:
if (condition) {
   do_something();
}

But enough with the soapboxing. So what other reasons for that "yech"?
  • You are making every decision twice. Your code will execute faster if you, for example, combine the if (j != 0) with the if (j == 0) code. Use the else statement.
  • You are making a lot of unnecessary decisions with respect to the outer loop variable i. Make those decisions outside the inner loop, capturing the results in some local variable.
  • That convoluted code is hiding what you are really doing here, which is a two-dimensional finite difference with wrapping at the edges of the array.

Putting this together, try


Code:
double dxsq = dx * dx;
for (i = 0; i < M; i++) {
   int ip = i > 0   ? i-1 : M-1;
   int in = i < M-1 ? i+1 : 0;
   for (j = 0; j < N; j++) {
      int jp = j > 0   ? j-1 : N-1;
      int jn = j < N-1 ? j+1 : 0;
      lap[i][j] = ((u[i][jp] - u[i][j]) +
                   (u[i][jn] - u[i][j]) +
                   (u[ip][j] - u[i][j]) +
                   (u[in][j] - u[i][j])) / dxsq;
   }
}

Now with the adjacency indices separated from the difference equation you can play around with different mechanisms for computing the same end result:
  • Which is faster: Branching to compute the adjacent indices, using the modulus operator, or caching those indices in an array?
  • Which is faster: The difference calculation as done above or something else such as
    lap[j] = ((u[jp]+u[jn]+u[ip][j]+u[in][j])-4*u[j]) / dxsq;?
 
  • #5
You could move the end point cases 0, M-1, N-1 to outside the loops, then eliminate all the conditionals in loops that go from i = 1 ... j = 1 to j < (N-1) to i < (M-1).
 
  • #6
Ok, thanks guys. I think caching the indexes in an array would be faster but not sure. I'll test your code.

Could you guys please tell me what I need to wrap around that code to compute the execution time? I assume it's a timer but I don't remember the syntax.

Thanks guys!

Also I just noticed something that DH noted: that dx*dx is floating-point multiplication that's being executed for each element in the array when really only one multiplication is necessary. That's a big improvement already and in fact this code is inside another loop that's being executed say 50000 times so I could remove that computation outside that loops and do it only once.
 
Last edited:
  • #7
I'm not sure I would go quite as far as DH did, but you certainly want to take all the "boundary cases" and their if statements out of your main loop. Personally I would put them in two separate loops (one for the row boundaries, the other for the column boundaries).

In the DH-style statement that calculates lap[j], get rid of as many parentheses as you can. In standard C/C++, they are forcing the compiler to evalute the expression in the particular order defined by the parentheses, and (at least at first sight) there seems to be no reason why that is essential.

You can then save some typing by writing -4*u[j] instead of 4 separate subtractions of u[j], of course.

A different approach to all of this is:

* Make your arrays two elements longer in each direction,
* After each iteration, copy the "wrapped round" data into the extra elements (in a separate loop).
* The main loop then needs no logic and is simple for the compler to optimize. You might as well leave the subscripts as i+1, i-1, etc, to give the compiler an easy optimization task.
 
  • #8
Ok guys. That's good suggestions. I figured out how to time the loop. Need to include <time.h> in my code. Looks like I can do a start=time() at the start of the loop and end=time() at the end and the difference gives me clock cycles but I think it's just something close to milliseconds.

So for my baseline, I'll start with the code I posted above, run the loops for both the u data and v data on 100x100 arrays (with some other code also in the loop) and I'll do this 1000 times.

For the starting code above, I get 2600 clock cycles. However, just computing the dxsq=dx*dx outside of the loop and using dxsq in the routine, this drops the execution time down to about 1970.

I'll see just how much I can cut the total execution time by trying the various suggestions above.
 
Last edited:
  • #9
jackmell said:
I don't think you need that other if in there huh?

Sorry, stupid mistake. What I was aiming at is that you have something like

Code:
if (a == something1) { do_something_1; }
if (a == something2) { do_something_2; }
if (a == something3) { do_something_3; }
if (a == something4) { do_something_4; }

it is obvious "a" can't be something1 and something2 at the same time. So after you have checked the first condition you don't have to check all others:

Code:
if (a == something1) { do_something_1; }
  else if (a == something2) { do_something_2; }
    else if (a == something3) { do_something_3; }
      else if (a == something4) { do_something_4; }

That's equivalent to the first point of the DH comments.
 
  • #10
I ran the timing with DH's code:

Code:
double dxsq = dx * dx;
for (i = 0; i < M; i++) {
   int ip = i > 0   ? i-1 : M-1;
   int in = i < M-1 ? i+1 : 0;
   for (j = 0; j < N; j++) {
      int jp = j > 0   ? j-1 : N-1;
      int jn = j < N-1 ? j+1 : 0;
      lap[i][j] = ((u[i][jp] - u[i][j]) +
                   (u[i][jn] - u[i][j]) +
                   (u[ip][j] - u[i][j]) +
                   (u[in][j] - u[i][j])) / dxsq;
   }
}

and obtained an average time of 1280 which I think is a nice improvement from the 1970 speed but that code I think is still testing a lot of conditionals which may not be necessary. I'll try to implement the other suggestions to see if I can reduce the execution time further.

I like rc's suggestion:

rcgldr said:
You could move the end point cases 0, M-1, N-1 to outside the loops, then eliminate all the conditionals in loops that go from i = 1 ... j = 1 to j < (N-1) to i < (M-1).

I think that means, suppose we have just a 1D array with u(10)=u(0. . . 9), then to eliminate all conditionals, I could use for the 1D Laplacian (in pseudo code):

lap(0)=u(1)-2u(0)+u(9)

for(i=1,i<=8,i++)
lap(i)=u(i+1)-2u(i)+u(i-1);

lap(9)=u(0)-2u(9)+u(8)

There, that's eliminated all the conditionals and I suppose I can implement that in 2D then in the program itself.
 
Last edited:
  • #11
AlephZero said:
In the DH-style statement that calculates lap[j], get rid of as many parentheses as you can. In standard C/C++, they are forcing the compiler to evalute the expression in the particular order defined by the parentheses, and (at least at first sight) there seems to be no reason why that is essential.

There are many cases where this is essential. It is important to remember that floating point representation is not exact. When you subtract two numbers that are "close" to one another you lose quite a bit of precision. When you add two numbers that are close to one another you lose a little bit of precision. Do this with four numbers and you lose even more. Now subtract 4*u[j] and the result might be a considerably less precise than that obtained using the forced grouping I used.

That said, I did suggest that alternate scheme, lap[j] = (u[jp]+u[jn]+u[ip][j]+u[in][j]-4*u[j]) / dxsq, at the end of post #4.


A different approach to all of this is make the arrays two elements longer in each direction [and] copy the "wrapped round" data into the extra elements.
That is a very good approach, particularly when you tell the compiler to optimize the code. Optimizers typically don't know what to do with those "wrapped round" indices, whether computed by an if statement, using modulus, or an array lookup. The optimizer can go to town with simple indices such as [j], [i+1][j], [i-1][j], [j+1], and [j-1].
 
  • #12
Hey guys. Those are all good suggestions. The code below is what I believe is the implementation eliminating all conditionals. It's a lot longer but reduces the execution time down to an average of 1020 (a little more than one second) which is still a nice improvement over the 1280. However, I'm not sure at this point if I have the logic correct. I do the four corners separately, the interior points of the perimeter separately, then the internal block (minus the perimeter). That's a total of 9 separate constructs.

Can anyone suggest a way that might reduce it further?

Code:
/* bottom left corner */

	  lap[0][0]=(u[0][1]-4*u[0][0]+u[0][M-1]+u[1][0]+u[N-1][0])/dxsq;

/* interior of bottom row */
	for(j=1;j<=M-2;j++)
	{
		lap[0][j]=(u[0][j+1]-4*u[0][j]+u[0][j-1]+u[1][j]+u[N-1][j])/dxsq;
	}

/* bottom right corner */
	lap[0][M-1]=(u[0][0]-4*u[0][M-1]+u[0][M-2]+u[1][M-1]+u[N-1][M-1])/dxsq;
	
/* internal column of left vertical strip */
	for(i=1;i<=N-2;i++)
	{
		lap[i][0]=(u[i][1]-4*u[i][0]+u[i][M-1]+u[i+1][0]+u[i-1][0])/dxsq;
	}

/* internal column of right verticle strip */
	for(i=1;i<=N-2;i++)
	{
		lap[i][M-1]=(u[i][0]-4*u[i][M-1]+u[i][M-2]+u[i+1][M-1]+u[i-1][M-1])/dxsq;
	}

/* top left corner */
	lap[N-1][0]=(u[N-1][1]-4*u[N-1][0]+u[N-1][M-1]+u[0][0]+u[N-2][0])/dxsq;

/* top right corner */
	lap[N-1][M-1]=(u[N-1][0]-4*u[N-1][M-1]+u[N-1][M-2]+u[0][M-1]+u[N-2][M-1])/dxsq;

/* internal row of top row */
	for(j=1;j<=M-2;j++)
	{
		lap[N-1][j]=(u[N-1][j+1]-4*u[N-1][j]+u[N-1][j-1]+u[0][j]+u[N-2][j])/dxsq;
	}

/* internal block */
	for(i=2;i<=N-2;i++)
		for(j=2;j<=M-2;j++)
	{
		lap[i][j]=(u[i][j+1]-4*u[i][j]+u[i][j-1]+u[i+1][j]+u[i-1][j])/dxsq;
	}
 
Last edited:

1. Can compiler optimization improve the execution time of my C code?

Yes, compiler optimization can significantly improve the execution time of your C code. By enabling certain optimizations, the compiler can rearrange and optimize your code to make it run faster. However, the effectiveness of compiler optimization may vary depending on the code and the compiler being used.

2. How can I measure the execution time of my C code?

You can measure the execution time of your C code by using the clock() function from the time.h library. This function returns the number of clock ticks since the program started running. You can use this value to calculate the execution time in seconds or milliseconds.

3. Can reducing the number of statements in my code improve its execution time?

Yes, reducing the number of statements in your code can potentially improve its execution time. Fewer statements mean less work for the processor, which can result in faster execution. However, reducing the number of statements should not come at the cost of readability and maintainability of your code.

4. Will parallel processing speed up my C code?

Yes, implementing parallel processing techniques such as multithreading or multiprocessing can significantly speed up the execution time of your C code. This is because parallel processing allows different parts of the code to be executed simultaneously, utilizing multiple processing units and reducing the overall execution time.

5. How can I optimize my C code for better execution time?

There are several ways to optimize your C code for better execution time. Some of the commonly used techniques include using efficient data structures, minimizing the use of memory, avoiding unnecessary function calls, and optimizing loops. It is also essential to regularly profile your code and identify any bottlenecks that can be optimized.

Similar threads

  • Programming and Computer Science
Replies
4
Views
607
  • Programming and Computer Science
Replies
1
Views
941
Replies
1
Views
1K
  • Programming and Computer Science
Replies
9
Views
1K
  • Programming and Computer Science
Replies
9
Views
1K
  • Programming and Computer Science
Replies
1
Views
943
  • Programming and Computer Science
Replies
19
Views
1K
  • Programming and Computer Science
Replies
9
Views
2K
  • Programming and Computer Science
Replies
25
Views
2K
  • Programming and Computer Science
Replies
2
Views
2K
Back
Top