# Optimization of C Code Loop Unrolling

## Homework Statement

I need to optimize this given code that rotates an image 90 degrees so it runs at least three times faster:
Code:
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;

for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

N/A

## The Attempt at a Solution

The function I write is tested with the variable dim equal to multiples of 32. I tried unrolling the loop but I keep getting an error that the expected values have changed when it reaches dim = 96. My code is this:
Code:
void rotate(int dim, pixel *src, pixel *dst)
{
int i;
int j;

for (j = 0; j < dim; j++){
for (i = 0; i < dim; i+=4){
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
dst[RIDX(dim-1-(j+1), i+1, dim)] = src[RIDX(i+1, j+1, dim)];
dst[RIDX(dim-1-(j+2), i+2, dim)] = src[RIDX(i+2, j+2, dim)];
dst[RIDX(dim-1-(j+3), i+3, dim)] = src[RIDX(i+3, j+3, dim)];
}
}
for (; j < dim; j++)
for (; i < dim; i++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I don't know why it's not working. I originally did not have that last cleanup loop and still got the same error at dim = 96. What am I doing wrong with the unrolling? Any help will be much appreciated!

Related Engineering and Comp Sci Homework Help News on Phys.org
Mark44
Mentor

## Homework Statement

I need to optimize this given code that rotates an image 90 degrees so it runs at least three times faster:
Code:
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;

for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

N/A

## The Attempt at a Solution

The function I write is tested with the variable dim equal to multiples of 32. I tried unrolling the loop but I keep getting an error that the expected values have changed when it reaches dim = 96. My code is this:
Code:
void rotate(int dim, pixel *src, pixel *dst)
{
int i;
int j;

for (j = 0; j < dim; j++){
for (i = 0; i < dim; i+=4){
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
dst[RIDX(dim-1-(j+1), i+1, dim)] = src[RIDX(i+1, j+1, dim)];
dst[RIDX(dim-1-(j+2), i+2, dim)] = src[RIDX(i+2, j+2, dim)];
dst[RIDX(dim-1-(j+3), i+3, dim)] = src[RIDX(i+3, j+3, dim)];
}
}
for (; j < dim; j++)
for (; i < dim; i++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I don't know why it's not working. I originally did not have that last cleanup loop and still got the same error at dim = 96. What am I doing wrong with the unrolling? Any help will be much appreciated!
You're only unrolling the inner loop (with i as the loop variable), so you shouldn't be fiddling with the indexes with j. In your optimized code the outer loop is going to run dim times (same is it did in the unoptimized code), but your inner loop is going to run dim/4 times, with each inner loop iteration doing four times as much.

I haven't tried this out, but I think this is what you want.
Code:
for (j = 0; j < dim; j++)
{
for (i = 0; i < dim; i+=4)
{
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
dst[RIDX(dim-1-j, i+1, dim)] = src[RIDX(i+1, j, dim)];
dst[RIDX(dim-1-j, i+2, dim)] = src[RIDX(i+2, j, dim)];
dst[RIDX(dim-1-j, i+3, dim)] = src[RIDX(i+3, j, dim)];
}
}

D H
Staff Emeritus

## Homework Statement

I need to optimize this given code that rotates an image 90 degrees so it runs at least three times faster:
First things first: How do you know that you need to optimize the given code? Have you profiled your program to find the biggest hog? Ofttimes the CPU utilization is limited to a very small (very, very small) part of a program. Optimize that, to heck with the rest. Is naive_rotate truly the culprit? For now, I'll assume you have done this due diligence and found the source of the CPU consumption.

In general, it is a bit optimistic to think that loop unrolling will achieving a factor of three optimization. You are asking for a very significant speedup. Loop unrolling most likely will not do that. 10% typically, 25% if you are lucky. You are asking for a 67% reduction. To achieve such huge reduction you should be looking at your algorithm and at the way you represent data.

What does this macro RIDX do? Something like #define RIDX(i,j,dim) (i*dim+j)? This might be a source of your inefficiency.

Your attempt at optimization doesn't work because you haven't unrolled the loop correctly. Your second set of loops (the for (; j < dim; j++) loop) will never execute because j is already at dim.

Last edited:
First things first: How do you know that you need to optimize the given code? Have you profiled your program to find the biggest hog? Ofttimes the CPU utilization is limited to a very small (very, very small) part of a program. Optimize that, to heck with the rest. Is naive_rotate truly the culprit? For now, I'll assume you have done this due diligence and found the source of the CPU consumption.
The way I understand the lab it seems like we should just be re-writing naive_rotate, but I didn't think of looking at the rest of the program. What's the best way to look at what's going the slowest? Compiling it and looking at the assembly seems like it would be the best option, yes?

In general, it is a bit optimistic to think that loop unrolling will achieving a factor of three optimization. You are asking for a very significant speedup. Loop unrolling most likely will not do that. 10% typically, 25% if you are lucky. You are asking for a 67% reduction. To achieve such huge reduction you should be looking at your algorithm and at the way you represent data.
Yeah, I figured it wouldn't get me to 3x faster, but I want to figure out the correct implementation of the unroll, it's driving me nuts!

What does this macro RIDX do? Something like #define RIDX(i,j,dim) (i*dim+j)? This might be a source of your inefficiency.
I don't even know why I haven't looked at that- would probably help me a bet, eh?

Your attempt at optimization doesn't work because you haven't unrolled the loop correctly. Your second set of loops (the for (; j < dim; j++) loop) will never execute because j is already at dim.
Ok, so how would I unroll it correctly? Is my first unroll correct? If it is, how to I properly set up the clean-up code. The resources I have on hand are useful, but they don't properly address the existence of two for loops, just one, so I'm a bit perplexed.

Thanks for the response, I appreciate!

D H
Staff Emeritus
The way I understand the lab it seems like we should just be re-writing naive_rotate, but I didn't think of looking at the rest of the program.
Ahhh - so this is a homework assignment then. What specifically are you supposed to do in this assignment?

What's the best way to look at what's going the slowest? Compiling it and looking at the assembly seems like it would be the best option, yes?
If you were specifically asked to rewrite naive_rotate to make it faster, then that is what you need to do. In general, the best thing to do is to compile and run your program with a profiler. The profiler will tell you where the hogs are.

I don't even know why I haven't looked at that- would probably help me a bet, eh?
You bet. I assume you know how to do pointer arithmetic. Use that knowledge.

Ok, so how would I unroll it correctly? Is my first unroll correct?
I only see one attempt, and no, that is not correct for a number of reasons. You have an unrolling factor of 4. Walk through your code with dim equal to 3, 4, 5, and 6 (four separate calls to the function). You overrun the array with dim=3, and you don't set row/column 4 (starting from 0) with dim=5, rows/columns 4 and 5 with dim=6. You will need to do something special to prevent an overrun and to handle the parts you did not set in the unrolled loop.

Ahhh - so this is a homework assignment then. What specifically are you supposed to do in this assignment?
Optimize naive_rotate and naive_smooth although I only posted rotate because I just want to get through this one first.

If you were specifically asked to rewrite naive_rotate to make it faster, then that is what you need to do. In general, the best thing to do is to compile and run your program with a profiler. The profiler will tell you where the hogs are.
No idea what a profiler is at all...

I only see one attempt, and no, that is not correct for a number of reasons. You have an unrolling factor of 4. Walk through your code with dim equal to 3, 4, 5, and 6 (four separate calls to the function). You overrun the array with dim=3, and you don't set row/column 4 (starting from 0) with dim=5, rows/columns 4 and 5 with dim=6. You will need to do something special to prevent an overrun and to handle the parts you did not set in the unrolled loop.
Yeah, I meant my first loop, my bad. dim, as I said, is only equal to multiples of 32 so dim is never equal to anything less than 32 so I am not overrunning the array. That's where I'm confused: why is my unroll not working correctly? I can't see any problems with it- though i am new (like within the past few days) to loop unrolling so I'm probably not seeing something important.

D H
Staff Emeritus
Pretend the dimension is 4 (it doesn't really matter than 4 is not a multiple of 32). Walk through your code and the original by hand. Are they doing the same thing?

I have to run some errands; I won't be back until quite a bit later this evening. I have asked other homework helpers to dive in, but that may or may not happen.

Pretend the dimension is 4 (it doesn't really matter than 4 is not a multiple of 32). Walk through your code and the original by hand. Are they doing the same thing?

I have to run some errands; I won't be back until quite a bit later this evening. I have asked other homework helpers to dive in, but that may or may not happen.
Will do. And this isn't due for two weeks so time is not of the essence here- I just want to work on it when I have some free time. Thanks for everything so far!

Ok, well I got the unrolling to work (I should not have been touching j since I wasn't incrementing it by 4) but the best I could get it to do was when I incremented i by 16 and my performance was exactly the same as the naive_rotate code. I'm thinking I need to have a few separate loops that take different sets of i and j (like I store rows 0-3 and columns 0-3 at the same time as storing rows 0-3 and columns 8-11) but I 1) don't know how to do that and 2) need to figure out the block sizes in the cache to utilize it properly.

Any suggestions on how to approach this, or do you not understand what I'm talking about (I barely do!)?

D H
Staff Emeritus