Optimization of C Code Loop Unrolling

In summary: Yes, RIDX is defined as you've given. I already have a working version of naive_rotate that passes all tests, but the goal is to make it faster. I'm not sure if I'm supposed to mess with the way the data is stored.I think the problem is with how I'm unrolling the loop. If I unroll it the way you wrote, it passes the tests. But I think this way should work too:for (j = 0; j < dim; j++){ for (i = 0; i < dim; i+=4){ dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)]; dst
  • #1
Fronzbot
62
0

Homework Statement


I need to optimize this given code that rotates an image 90 degrees so it runs at least three times faster:
Code:
void naive_rotate(int dim, pixel *src, pixel *dst)
{
    int i, j;

    for (i = 0; i < dim; i++)
	for (j = 0; j < dim; j++)
	    dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}


Homework Equations


N/A


The Attempt at a Solution


The function I write is tested with the variable dim equal to multiples of 32. I tried unrolling the loop but I keep getting an error that the expected values have changed when it reaches dim = 96. My code is this:
Code:
void rotate(int dim, pixel *src, pixel *dst)
{
  	int i;
  	int j;

        for (j = 0; j < dim; j++){
  		for (i = 0; i < dim; i+=4){
  	    	dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
  	    	dst[RIDX(dim-1-(j+1), i+1, dim)] = src[RIDX(i+1, j+1, dim)];
  	    	dst[RIDX(dim-1-(j+2), i+2, dim)] = src[RIDX(i+2, j+2, dim)];
  	    	dst[RIDX(dim-1-(j+3), i+3, dim)] = src[RIDX(i+3, j+3, dim)];
		}
	}
	for (; j < dim; j++)
		for (; i < dim; i++)
			dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

I don't know why it's not working. I originally did not have that last cleanup loop and still got the same error at dim = 96. What am I doing wrong with the unrolling? Any help will be much appreciated!
 
Physics news on Phys.org
  • #2
Fronzbot said:

Homework Statement


I need to optimize this given code that rotates an image 90 degrees so it runs at least three times faster:
Code:
void naive_rotate(int dim, pixel *src, pixel *dst)
{
    int i, j;

    for (i = 0; i < dim; i++)
	for (j = 0; j < dim; j++)
	    dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}


Homework Equations


N/A


The Attempt at a Solution


The function I write is tested with the variable dim equal to multiples of 32. I tried unrolling the loop but I keep getting an error that the expected values have changed when it reaches dim = 96. My code is this:
Code:
void rotate(int dim, pixel *src, pixel *dst)
{
  	int i;
  	int j;

        for (j = 0; j < dim; j++){
  		for (i = 0; i < dim; i+=4){
  	    	dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
  	    	dst[RIDX(dim-1-(j+1), i+1, dim)] = src[RIDX(i+1, j+1, dim)];
  	    	dst[RIDX(dim-1-(j+2), i+2, dim)] = src[RIDX(i+2, j+2, dim)];
  	    	dst[RIDX(dim-1-(j+3), i+3, dim)] = src[RIDX(i+3, j+3, dim)];
		}
	}
	for (; j < dim; j++)
		for (; i < dim; i++)
			dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

I don't know why it's not working. I originally did not have that last cleanup loop and still got the same error at dim = 96. What am I doing wrong with the unrolling? Any help will be much appreciated!

You're only unrolling the inner loop (with i as the loop variable), so you shouldn't be fiddling with the indexes with j. In your optimized code the outer loop is going to run dim times (same is it did in the unoptimized code), but your inner loop is going to run dim/4 times, with each inner loop iteration doing four times as much.

I haven't tried this out, but I think this is what you want.
Code:
for (j = 0; j < dim; j++)
{
   for (i = 0; i < dim; i+=4)
   {
      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
      dst[RIDX(dim-1-j, i+1, dim)] = src[RIDX(i+1, j, dim)];
      dst[RIDX(dim-1-j, i+2, dim)] = src[RIDX(i+2, j, dim)];
      dst[RIDX(dim-1-j, i+3, dim)] = src[RIDX(i+3, j, dim)];
   }
}
 
  • #3
Fronzbot said:

Homework Statement


I need to optimize this given code that rotates an image 90 degrees so it runs at least three times faster:
First things first: How do you know that you need to optimize the given code? Have you profiled your program to find the biggest hog? Ofttimes the CPU utilization is limited to a very small (very, very small) part of a program. Optimize that, to heck with the rest. Is naive_rotate truly the culprit? For now, I'll assume you have done this due diligence and found the source of the CPU consumption.

In general, it is a bit optimistic to think that loop unrolling will achieving a factor of three optimization. You are asking for a very significant speedup. Loop unrolling most likely will not do that. 10% typically, 25% if you are lucky. You are asking for a 67% reduction. To achieve such huge reduction you should be looking at your algorithm and at the way you represent data.

What does this macro RIDX do? Something like #define RIDX(i,j,dim) (i*dim+j)? This might be a source of your inefficiency.Your attempt at optimization doesn't work because you haven't unrolled the loop correctly. Your second set of loops (the for (; j < dim; j++) loop) will never execute because j is already at dim.
 
Last edited:
  • #4
D H said:
First things first: How do you know that you need to optimize the given code? Have you profiled your program to find the biggest hog? Ofttimes the CPU utilization is limited to a very small (very, very small) part of a program. Optimize that, to heck with the rest. Is naive_rotate truly the culprit? For now, I'll assume you have done this due diligence and found the source of the CPU consumption.
The way I understand the lab it seems like we should just be re-writing naive_rotate, but I didn't think of looking at the rest of the program. What's the best way to look at what's going the slowest? Compiling it and looking at the assembly seems like it would be the best option, yes?

In general, it is a bit optimistic to think that loop unrolling will achieving a factor of three optimization. You are asking for a very significant speedup. Loop unrolling most likely will not do that. 10% typically, 25% if you are lucky. You are asking for a 67% reduction. To achieve such huge reduction you should be looking at your algorithm and at the way you represent data.
Yeah, I figured it wouldn't get me to 3x faster, but I want to figure out the correct implementation of the unroll, it's driving me nuts!

What does this macro RIDX do? Something like #define RIDX(i,j,dim) (i*dim+j)? This might be a source of your inefficiency.
I don't even know why I haven't looked at that- would probably help me a bet, eh?

Your attempt at optimization doesn't work because you haven't unrolled the loop correctly. Your second set of loops (the for (; j < dim; j++) loop) will never execute because j is already at dim.
Ok, so how would I unroll it correctly? Is my first unroll correct? If it is, how to I properly set up the clean-up code. The resources I have on hand are useful, but they don't properly address the existence of two for loops, just one, so I'm a bit perplexed.

Thanks for the response, I appreciate!
 
  • #5
Fronzbot said:
The way I understand the lab it seems like we should just be re-writing naive_rotate, but I didn't think of looking at the rest of the program.
Ahhh - so this is a homework assignment then. What specifically are you supposed to do in this assignment?

What's the best way to look at what's going the slowest? Compiling it and looking at the assembly seems like it would be the best option, yes?
If you were specifically asked to rewrite naive_rotate to make it faster, then that is what you need to do. In general, the best thing to do is to compile and run your program with a profiler. The profiler will tell you where the hogs are.

I don't even know why I haven't looked at that- would probably help me a bet, eh?
You bet. I assume you know how to do pointer arithmetic. Use that knowledge.


Ok, so how would I unroll it correctly? Is my first unroll correct?
I only see one attempt, and no, that is not correct for a number of reasons. You have an unrolling factor of 4. Walk through your code with dim equal to 3, 4, 5, and 6 (four separate calls to the function). You overrun the array with dim=3, and you don't set row/column 4 (starting from 0) with dim=5, rows/columns 4 and 5 with dim=6. You will need to do something special to prevent an overrun and to handle the parts you did not set in the unrolled loop.
 
  • #6
D H said:
Ahhh - so this is a homework assignment then. What specifically are you supposed to do in this assignment?
Optimize naive_rotate and naive_smooth although I only posted rotate because I just want to get through this one first.

If you were specifically asked to rewrite naive_rotate to make it faster, then that is what you need to do. In general, the best thing to do is to compile and run your program with a profiler. The profiler will tell you where the hogs are.
No idea what a profiler is at all...


I only see one attempt, and no, that is not correct for a number of reasons. You have an unrolling factor of 4. Walk through your code with dim equal to 3, 4, 5, and 6 (four separate calls to the function). You overrun the array with dim=3, and you don't set row/column 4 (starting from 0) with dim=5, rows/columns 4 and 5 with dim=6. You will need to do something special to prevent an overrun and to handle the parts you did not set in the unrolled loop.
Yeah, I meant my first loop, my bad. dim, as I said, is only equal to multiples of 32 so dim is never equal to anything less than 32 so I am not overrunning the array. That's where I'm confused: why is my unroll not working correctly? I can't see any problems with it- though i am new (like within the past few days) to loop unrolling so I'm probably not seeing something important.
 
  • #7
Pretend the dimension is 4 (it doesn't really matter than 4 is not a multiple of 32). Walk through your code and the original by hand. Are they doing the same thing?

I have to run some errands; I won't be back until quite a bit later this evening. I have asked other homework helpers to dive in, but that may or may not happen.
 
  • #8
D H said:
Pretend the dimension is 4 (it doesn't really matter than 4 is not a multiple of 32). Walk through your code and the original by hand. Are they doing the same thing?

I have to run some errands; I won't be back until quite a bit later this evening. I have asked other homework helpers to dive in, but that may or may not happen.

Will do. And this isn't due for two weeks so time is not of the essence here- I just want to work on it when I have some free time. Thanks for everything so far!
 
  • #9
Ok, well I got the unrolling to work (I should not have been touching j since I wasn't incrementing it by 4) but the best I could get it to do was when I incremented i by 16 and my performance was exactly the same as the naive_rotate code. I'm thinking I need to have a few separate loops that take different sets of i and j (like I store rows 0-3 and columns 0-3 at the same time as storing rows 0-3 and columns 8-11) but I 1) don't know how to do that and 2) need to figure out the block sizes in the cache to utilize it properly.

Any suggestions on how to approach this, or do you not understand what I'm talking about (I barely do!)?
 
  • #10
I think you should follow my suggestions to (a) look at what RIDX is doing and (b) use pointer arithmetic.
 
  • #11
Hey, sorry I haven't responded back I've been busy with other coursework.

Anyways, I checked and RIDX is not in my file at all- I assume it's defined elsewhere, but I don't know what file (there are a ton- we don't need to go into any of them says my prof). As for pointer arithmetic- I don't really know what that is...
 

1. What is loop unrolling in C code optimization?

Loop unrolling is a technique used in C code optimization to improve the performance of loops. It involves manually expanding the body of a loop by a certain number of iterations, reducing the number of loop iterations and thus improving efficiency.

2. What are the benefits of loop unrolling in C code optimization?

Loop unrolling can lead to improved performance by reducing the overhead of loop control and iteration, as well as improving cache usage and reducing branch mispredictions. It can also increase the opportunities for compiler optimizations.

3. How do you determine the optimal number of loop unrollings?

The optimal number of loop unrollings depends on factors such as the size of the loop, the complexity of the loop body, and the target hardware. It is often determined through experimentation and fine-tuning.

4. Are there any potential drawbacks of loop unrolling in C code optimization?

One potential drawback of loop unrolling is increased code size, which can negatively impact cache usage and lead to longer compile times. It can also make the code more difficult to read and maintain.

5. What are some other techniques for optimizing C code besides loop unrolling?

Some other techniques for optimizing C code include using data types and data structures efficiently, avoiding unnecessary operations and function calls, and using compiler-specific optimization flags. Profiling and identifying bottlenecks can also help in optimizing code.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
25
Views
19K
  • Engineering and Comp Sci Homework Help
Replies
17
Views
1K
  • Programming and Computer Science
Replies
9
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
7
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
2
Views
972
  • Engineering and Comp Sci Homework Help
Replies
1
Views
845
  • Engineering and Comp Sci Homework Help
Replies
18
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • Introductory Physics Homework Help
Replies
3
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
1K
Back
Top