Sorting Algorithm Performance - Interview-Based Problem

Monte_Carlo · Mar 6, 2012

Hi guys,

My professor recently asked a challenging question - he says it's been asked on an interview and he wanted to pose it as a homework.

Attached is the graph of sorting algorithm performance. The interviewer asks: "Tell me everything you can about this algorithm just by looking at the data on the plot." Obviously the idea is to characterize performance and nature of the algorithm. It's a sorting algorithm operating on strings. The x-axis is the number of elements in the input and the y-axis is the tick-count to complete the sort. That's the universe of information.

Does it even sound reasonable? It's the first time I find such a question on algorithm analysis. Usually the questions are of the form: "... given this code, characterize the complexity of this algorithm ... " but this one sort-of in reverse. I believe the algorithm is not standard but it is an interview-based problem, so I'm assuming it's reasonable to try it hard.

Thanks,

Monte

berkeman · Mar 6, 2012

Monte_Carlo said:

Hi guys,

My professor recently asked a challenging question - he says it's been asked on an interview and he wanted to pose it as a homework.

Attached is the graph of sorting algorithm performance. The interviewer asks: "Tell me everything you can about this algorithm just by looking at the data on the plot." Obviously the idea is to characterize performance and nature of the algorithm. It's a sorting algorithm operating on strings. The x-axis is the number of elements in the input and the y-axis is the tick-count to complete the sort. That's the universe of information.

Does it even sound reasonable? It's the first time I find such a question on algorithm analysis. Usually the questions are of the form: "... given this code, characterize the complexity of this algorithm ... " but this one sort-of in reverse. I believe the algorithm is not standard but it is an interview-based problem, so I'm assuming it's reasonable to try it hard.

Thanks,

Monte

What are your thoughts so far? Looking at the graph, a number of things come to mind about the input data and the algorithm...

Monte_Carlo · Mar 6, 2012

I think it may be an O(n lg(n)) algorithm but I can't explain the downward trend observed in the end. It definitely doesn't look like it's linear in the input size.

Monte_Carlo · Mar 6, 2012

Also, I attach the time series.

berkeman · Mar 6, 2012

Monte_Carlo said:

I think it may be an O(n lg(n)) algorithm but I can't explain the downward trend observed in the end. It definitely doesn't look like it's linear in the input size.

The interviewer apparently doesn't give you any information about the input strings. What are your thoughts on options for the input strings? What effects could that have on the plot?

Monte_Carlo · Mar 6, 2012

Well he didn't say anything in addition, but I guess the idea was to gauge candidate's performance based on the questions s/he will ask the interviewer.

May be the extent to which the input is sorted is correlated with input size, i.e. larger input size - more sorted -> less time to sort? Maybe that's a variable which is responsible for an apparent speed with larger inputs.

What are your thoughts? What about complexity?

berkeman · Mar 6, 2012

Monte_Carlo said:

Well he didn't say anything in addition, but I guess the idea was to gauge candidate's performance based on the questions s/he will ask the interviewer.

May be the extent to which the input is sorted is correlated with input size, i.e. larger input size - more sorted -> less time to sort? Maybe that's a variable which is responsible for an apparent speed with larger inputs.

What are your thoughts? What about complexity?

Like I said, I have a number of thoughts that come to mind in looking at the graph. But after all, *you* are the one being "interviewed"...

EDIT -- Try "talking out loud", as if you were actually in an interview. Talk about the possibilities for the input data, and how that could create some of the patterns... Talk about the algorithm, and some possibilities for why the performance improves for the larger data sizes... Ask some questions that would help you to figure more of it out...

Monte_Carlo · Mar 6, 2012

The performance doesn't seem to be linear in the input size, that's for sure. The speed up is not abrupt, as would be expected if there was a conditional, say like if an element to be sorted is longer than some cutoff value then disregard from the input, so that doesn't seem like an option. I think it has to do with the way how the data is pre-sorted. It seems like larger inputs are already sorted so require less sorting done on them and the running time is smaller. Also, I think that the professor pointed out that the answer should come from close examination of the data.

You sound like you see through the problem, so I'm all ears.

berkeman · Mar 6, 2012

Monte_Carlo said:

The performance doesn't seem to be linear in the input size, that's for sure. The speed up is not abrupt, as would be expected if there was a conditional, say like if an element to be sorted is longer than some cutoff value then disregard from the input, so that doesn't seem like an option. I think it has to do with the way how the data is pre-sorted. It seems like larger inputs are already sorted so require less sorting done on them and the running time is smaller. Also, I think that the professor pointed out that the answer should come from close examination of the data.

You sound like you see through the problem, so I'm all ears.

Some good thoughts there. How do you think the graph would look different if the input string data were completely random?

And on the question about the improvement in sorting for the larger string sizes, what reasons could you think of for this?

Any thoughts on what the strings look like, based on the graph? What order were the strings given to the algorithm?

Monte_Carlo · Mar 6, 2012

What I think is that there is a variation in how sorted is the array depending on array size. I believe that around 570 elements, the inputs (i.e. the array to be sorted) is becoming somewhat unsorted corresponding to the rise in the time to sort, but then that levels off because the inputs become more sorted.

If the inputs were uniformly unsorted all the time (i.e. any particular string in the input could occur in any position with equal probability) I think would see a classical n-lg-n shape without the observed decline. I believe the divergence between that scenario and the observed begins at around input size 570 or so.

You mention larger string sizes, do you think strings have variable length? If so, what makes you suppose that? Is it something in the data?

rcgldr · Mar 6, 2012

Monte_Carlo said:

You mention larger string sizes, do you think strings have variable length?

The title of the graph is "clock ticks vs input size". If element size decreased with number of elements, then the title of the graph would be mis-leading. The title of the graph implies average or actual element size is near constant.

There's no reason for a larger amount of data to tend to be more pre-sorted or "algorithm favorable" than a smaller amount of data, unless there's something special about the data, but that's not the question being asked.

I'm not aware of any sort algorithm that becomes faster with more data. Take the simplest case of a natural merge sort with presorted data, it does n-1 compares and declares the data is already sorted, so this would be O(n-1). Howver the graph falls off with a larger number of elements, so it's less than O(n). If the data was fitted with a polynomial equation, you'd have a negative coefficient for the higher terms (x³ , x⁴).

Monte_Carlo · Mar 6, 2012

Ok, but what are your conclusions? What explains the observed dynamic? Are you analyzing data in Excel or just looking at the chart? I recall the Pr. saying that it's better to look at the data in Excel but I don't know why. You can easily paste the data from the time-series into Excel.

Why would the chart be misleading if the element size was a variable? I'm not challenging you, I just need to know so I can explain this to the professor. Just so that we're on the same page, 'Input size' means count of elements in the array to be sorted. I don't see how title of the chart implies near-constant or average element size. May be it's something supposed in comp sci literature? I haven't seen it so far, maybe it's assumed in the textbooks but I don't know if this is assumed otherwise.

The question is: "Tell me everything you can about this algorithm just by looking at the data on the plot." There is no statement about input data other than the above. Just the time series, the fact that the algorithm is a sorting algorithm and it operates on strings. Berkeman seems to have good ideas.

berkeman · Mar 6, 2012

Monte_Carlo said:

Ok, but what are your conclusions? What explains the observed dynamic?

Remember that this is your assignment, so we won't be offering any conclusions. We can ask questions and give hints, but the bulk of the work needs to be yours.

I agree that it seems implied that the element size is a constant, and only the number of them is varying.

I also agree that if the input strings were completely random, I don't think there would be some of the characteristics that you see on the graph.

Would it make any difference, maybe, if the experiment were run from left to right? As opposed to starting with the larger string runs first? What are your thoughts on that?

rcgldr · Mar 6, 2012

Monte_Carlo said:

Are you analyzing data in Excel or just looking at the chart?

Initially by just looking at the data, I've since done a quick curve fit using a program I wrote. Excel could be used to do a similar curve fit. I'm not sure this helps much other than to confirm that the number of clock ticks tended to decrease once number of elements got beyond a certain size.

Monte_Carlo said:

Why would the chart be misleading if the element size was a variable?

Variable element size would be OK as long as average element size doesn't vary with number of elements. Element size decreasing with number of elements would be a conflict. "Input size" implies an amount of data, (like megabytes). If the element size decreased with number of elements, the graph title should be "clock ticks vs number of elements".

Monte_Carlo said:

The fact that the algorithm is a sorting algorithm and it operates on strings.

Generally with a larger number of elements, you'll find more elmenents where the leading characters of the field to be compared are the same, requiring more time to do a compare, but the chart is showing an opposing trend.

For a sort program that operates on strings, ignoring the sort algorithm itself, what type of data structures would you use in a program to sort strings, which may be variable length strings?

Monte_Carlo · Mar 6, 2012

You could be sure that the x-axis is the number of elements in the array, so yes, the chart should be renamed accordingly 'Clock Ticks' vs 'Count of Input Elements'. This was very apparent from the discussion in the class.

We're making good progress so far. The word 'complexity' still rings in my head. I agree the ideas should be mostly mine. I think it's n-log-n but I don't know if I'm looking at the data in the right way. You mentioned curve fit in Excel. Anything you can say on the basis of that maybe?

rcgldr · Mar 6, 2012

Monte_Carlo said:

I think it's n-log-n.

If it was n log n, the curve would continue to get steeper as the number of elements increased, but instead it levels off and starts diminishing. There's more going on here than just a sort algorithm. Somehow the data itself is affecting the results, but the only information provided is that as the amount of data increases beyond some point, the time to sort that data decreases, which wouldn't occur in a typical situation.

Monte_Carlo · Mar 7, 2012

Hi guys,

I just got the answer to the problem - it turns out the algorithm is a plain bubble sort with variable-length strings. The length of inputs varies with counts of elements in a linear fashion (inverse linear) so for small input sizes we've got long strings and for larger element counts we've got shorter strings. That's enough to explain the observed dynamic.

Somebody in the class proposed the issue explaining the graph could equally have been sortedness of the inputs and that quadratic trend observed in the beginning could have been quicksort's worst case. But the prof said it's very rare to observe quick sort operate in quadratic fashion. He showed how to fit the curve for the first few data points - there are some outliers to obfuscate - but the point is clear - quadratic fit is the right one.

Anyway, I've never seen any problem like this. If that's what's being asked on the interviews, then I need lots of luck. Any ideas where to find similar problems?

How about this - maybe somebody on this forum can create such problems for people to exercise with?

rcgldr · Mar 7, 2012

Monte_Carlo said:

I just got the answer to the problem - it turns out the algorithm is a plain bubble sort with variable-length strings. The length of inputs varies with counts of elements in a linear fashion (inverse linear) so for small input sizes we've got long strings and for larger element counts we've got shorter strings. That's enough to explain the observed dynamic.

I already brought up the issue that the graph was titled clock tick versus input size as opposed to clock ticks versus number of elements (where element size decreased with number of elements) which I suspected. What was the actual title used on the graph for this problem?

If the data size is fixed, and element size decreases lineary with number of elements, with only the number of elements changing, then the amount of data would be huge, as it would have to be the smallest common multiple for every number 10, 50, 90, ... 1970, 2010 (= 10 (1 + 4 m), where 0 <= m <= 50), since the graphs x-axis includes all those sample points (over 10⁴⁰ elements, by my guestimate).

Ignoring that issue with all the sample points (what is the total data size when there are 1970 elements versus 2010 elements), even if element size did decrease linearly with number of elements, I question if the outcome would match the results shown on the graph. Say there's 1 mega-byte of data, so it may be sorting 10 strings of 100,000 bytes each or 2000 strings of 5,000 bytes each. The number of operations increases by n² and the time for a swap only decreases by 1/n. The compare time would depend on how many characters had to be compared before a difference is found, and it's probable that the compare time isn't significantly affected by element size, because any pair of strings would differ well before the 5,000th character. So even in this case, the number of clock ticks versus number of elements isn't going to fall off the way it does in the graph.

Typically when sorting strings, an array of pointers to strings is used, with only the pointers being moved or swapped and the strings only used for compare, except for final output (one set of moves for the strings). In this case, the number of clock ticks would versus number of elements sorted would not reach a peak and fall off as it does in the graph.

There are other sort algorithms with similar overhead to bubble sort, such as comb sort or gnome sort. Wiki's list of sort algorithms:

wiki comparason of sort algorithms.htm

So as mentioned before, if that graph is to be believed, then it was the data affecting the outcome, not the sort algorithm. Is there an actual set of data that when sorted by bubble sort produces the graph shown (using clock ticks as a relative number)?

If the data is allowed to be manipulated, then different sets of data for each case could be choosen such that the worst case occurs for any sort algorithm with a worst case of O(n²), even if it's average case is O(n log n), and choose sets of data where the worst case doesn't occur with larger number of elements.

Monte_Carlo · Mar 7, 2012

Hi rcgldr,

Yes, I checked with professor, the graph title shouldn't have included word: "Size" though he did say the x-axis itself is labeled correctly, and ordinarily when presented with algorithm performance chart, the size of input implies count of elements, not bytes.

I got the source code which generated the data (attached) and yes, it does look like there is a linear decrease in element size as the element count increases. The code is c++, compiles easily in CodeBlocks (I checked). One thing I noticed is the comparison of the strings that was used, though on second thought, I don't think it really matters - it's linear in string length. I don't see how it can be any better. Probably c++'s own comparison operator is faster but I guess the point is that string comparison can't be constant time due to variable length. Their string comp algorithm though seems to be taking only strings of equal length - which is consistent with the way it's used.

I liked the data-analytics orientation of this problem. Again, I've never seen anything like this in a textbook on algorithm analysis. Mostly the exercises ask to infer performance from the code. Any more such problems would be greatly appreciated.

rcgldr · Mar 7, 2012

Now I see the problem with that code example, the amount of data being sorted greatly depends on the number of elements. Looking at testsort(), string length is set to 20000-10*n, and with n strings, the amount of data being sorted is:

amount_of_data = n x (20000 - 10 n) = -10 n² + 20000 n

So the amount of data versus n is a parabola with a peak at n == 1000, some sample points:

Code:

   n     amount of data sorted

0010 | 00199000
0050 | 00975000
0090 | 01719000
0100 | 01900000
...
0970 | 09991000
1000 | 10000000
1010 | 09999000
1050 | 09975000
...
1890 | 02079000
1930 | 01351000
1970 | 00591000
1990 | 00199000

To fix this, change testsort() so that string length = 10000000 / n.

Monte_Carlo · Mar 7, 2012

This is great, but I guess the interviewer wouldn't have been impressed. If I got this right, there is something in the data that's supposed to reveal this dependency. Maybe try to fit parabola with n = 3? That one has nice pearson to it, and may be does suggest an additional quadratic relationship that you've described.

rcgldr · Mar 7, 2012

If you assume that some large fixed amount of data of size s was divided into n elements of of length l, and you noticed that the time had a peak value and then started to decline over time, then something was wrong about the assumption.

If the assumed sort agorithm overhead is O(n²), then the expected time would be O(n²) x (s/n), which would be a linear line, not a parabola like shape.

If the assumed sort agorithm overhead is O(n log n), then the expected time would be O(n log n) x (s/n), which would have the same shape as log n, which still wouldn't reach a peak and decrease.

So the next assumption would have to be that there was a problem in the modeling itself, which is what happended in this case, less data was sorted for small or large n.

Complicating matters is that compares exit as soon as a mismatch occurs, and with a larger number of elements you reach a point where the leading characters of many elements are the same, so the compare takes longer. The bubble sort itself also has an early exit if no swaps are done in a pass, but I'm not sure how this is affected by the number of elements.

Monte_Carlo · Mar 7, 2012

Yes, I understand your analysis and that's what the instructor was talking about as well - for you to gain confidence in the nature of the underlying algorithm, you have to look at the data. The data in the beginning (points 1 to 82) clearly suggest a quadratic shape, which (if I understand this right and I don't know if you would agree with) is supposed to make you rule out the possibility of it being n-lg-n algorithm from the start. You have to look at the data in Excel and try to fit different curves and that is supposed to guide you, because from the distance it may look like linear or n-lg-n but it's really quadratic...

Then you can create explanation for why the performance improves with size - you're already given that the input consists of strings. There is no understanding that the total byte size of the input must be constant between different data points. So the way I see it, you're left with just a few options - the sortedness of the input is increasing so less time is spent and/or the proportion of shorter-length strings is increasing with larger input. In this particular case, all strings for a particular data-point have the same length which is decreasing but I guess anyone of those explanations would have been sufficient for the interviewer. The key was the data.

Anyway, I'm looking at the data, and I see this bump on the chart, between the points 1 to 82. When you run the program on your computer, do you get that bump as well? If yes, is there any explanation to it, or is it just to obfuscate?

rcgldr · Mar 7, 2012

The data you have is not too useful because the code is not sorting since the compare routine is broken:

Code:

int compStr(string a, string b){ // no length of string check
    int i = 0; 
    while ( (a[i++] == b[i++]));   // i gets incremented twice
    if (a[i-1] < b[i-1]) return(-1); 
    if (a[i-1] > b[i-1]) return(1); 
    return(0); 
}

Here is a version that works:

Code:

int compStr(string a, string b){
    size_t l = a.length();
    for(size_t i = 0; i < l; i++){
        if(a[i] == b[i])
            continue;
        if (a[i] < b[i]) return(-1);
        else return(1); 
    }
    return(0); 
}

I previously mentioned the fact that the reason for the drop off at large n was because the amount of data created by testsort() was less for small and large n, with a peak amount of 10,000,000 bytes at n=1000, and a minimum amount of 199,000 bytes at n=10 and n=1990.

I fixed the code and made two more examples in this zip file:

http://rcgldr.net/misc/sortb.zip

s0.cpp - original source with fixes, switched to high performance timer
s1.cpp - uses arrays of characters instead of c++ strings, swaps arrays
s1.cpp - uses arrays of characters instead of c++ strings, swaps ptrs to arrays

I saved the output files, which you can import into a spreadsheet as space delimited text files to make graphs:

out0.txt - output from s0.cpp, nearly linear (slightly more) since it's O(n² / n)

out1.txt - output from s1.cpp, nearly linear as expected, since it's O(n² / n). There are some local peaks (there's one near 1230), but I'm not sure why. Could be related to caching, since I'm using the same 10MB buffer for strings, plus a 1MB buffer for string swap, for all cases.

out2.txt - output from s2.cpp, nearly quadratic O(n²) as expected, since ptr swaps are same regardless of string size, and compare overhead only increases slightly with large number of records due to comparing a few more character until a difference is found.

Monte_Carlo · Mar 7, 2012

rcgldr great work! I'm learning a lot from you - the high performance timer I've never heard of! Looks very advanced.

And yes, the comparison routine does look broken, I'll ask professor next time if he has tempered with it.

I just looked at s0.cpp and it seems like you've changed the size of each string to be non-linear with input element count. I mean if the input element count is n and the total size is c then you calculate length of string for input count as l = c / n so it's like a hyperbola - a non-linear relationship. I changed it back to original int l = 20000 - n * 10 and tested. I looked at out1.txt and I still see a decline, the chart looks very much like the original, even though I left comparison routine fix you provided. So it doesn't look like the broken comparison routine actually explained the original data.

Since you've got knowledge of performance profiling, do you know of a way to look at this potential quirk with cache? How would you approach this problem if you were asked to investigate the way this program interacts with cache levels? I don't know if any API which would do that, maybe you can help. I'll be wrapping my head around the s1.cpp and s2.cpp in the mean time.

rcgldr · Mar 7, 2012

note - I updated s0.cpp to use c++ string's swap member function to swap strings. This reduced the overall time, but the nearly linear curve remained about the same.

http://rcgldr.net/misc/sortb.zip

I left the old version in http://rcgldr.net/misc/sortb0.zip if you still want that one.

Monte_Carlo said:

I just looked at s0.cpp and it seems like you've changed the size of each string to be non-linear with input element count. I mean if the input element count is n and the total size is c then you calculate length of string for input count as l = c / n so it's like a hyperbola - a non-linear relationship.

True, but the point of doing that is so the total size is the same: c ~= n x l ~= constant size. If you use l = 20000 - 10 n, then the total amount of data sorted is 199,000 bytes for n = 10 and n = 1990, while the total amount of data sorted is 10,000,000 bytes for n = 1000. I could set the size of each element to be some obscure function of the number of elements to confuse the results no matter what type of sort algorithm was used. This doesn't seem very useful to me, and it doesn't match the title of the graph, number of clock ticks versus input size.

To summarize my analysis of the graph in your first post, I would state that the algorithm can't be determined and it appears that the total amount of data is varying with the number of elements, since the graph doesn't follow what would be expected in a situation where the total amount of data is fixed or increases linearly with the number of elements to be sorted. If the amount of data varies as some unknown function of number of elements, then how can the algorithm be determined?

Monte_Carlo said:

Since you've got knowledge of performance profiling, do you know of a way to look at this potential quirk with cache? How would you approach this problem if you were asked to investigate the way this program interacts with cache levels? I don't know if any API which would do that, maybe you can help. I'll be wrapping my head around the s1.cpp and s2.cpp in the mean time.

There are tools to do this, but I haven't used profiling tools in over a decade. I don't know if the cache can be disabled to see how that is affecting the results. There's also the issue of cas latency effects with ram:

http://en.wikipedia.org/wiki/CAS_latency

To get around these issues, rather than benchmark by cpu time, you could add counters to count the number of comparasons, number of swaps, number of bytes compared, ... . This will give an alternate method to theorectically compare algorithms, while the cpu time gives a more real world comparason.

The profilers I used were mostly for embedded software (software in devices). There would be a fairly fast sample rate that would store an array of PC addresses into memory on a PC while running the embedded software in a device. Then the link map was used to generate a histogram of number of samples per function for all the functions, then sorted to see which functions were taking the most of the cpu time. Profilers for a PC would work in a similar fashion. An alternative approach that isn't real time and very slow would simply trace (breakpoint) on every instruction and store a psuedo time stamp in a huge array of data.

Monte_Carlo said:

The high performance timer I've never heard of!

The various timers aren't well documented in visual studio or at msdn unless you already have some idea of the name of the timer you want. There are waitable timers and multi-media timers, as well as the high resolution counter.

Use QueryPerformanceFrequency() to get the frequency for QueryPerformanceCounter(). This is query only, you can't trigger event or functions based on this counter.

Waitable and high resolution timers:
msdn timers.htm

Multi media timers:
msdn multimedia timers.htm

timeBeginPeriod():
msdn timeBeginPeriod.htm

Note that timeBeginPeriod() affects both waitable and multi-media timer resolution. On my 3 systems, timeBeginPeriod(1) results in a timer frequency of 1024hz, which the multi-media timers fake as 1000hz by adding an extra delay on the 42nd, 84th, and 125th tick (125+3) (1000/1024) = 125ms, repeating this pattern every 125ms. The sequence timeBeginPeriod(1), SetWaitableTimer(,,1,,,); results in a 512hz timer on those same 3 systems, but without faking 500hz with extra delays. It should result in a 1024hz timer, but it ends up at 512hz.

Sorting Algorithm Performance - Interview-Based Problem

Attachments

Attachments

Attachments

Similar threads

Hot Threads

Engineering Why is my output current so low in this Transconductance Amplifier cell?

LTspice: Implementing a Single Balanced BJT Mixer

Engineering Diff gain of a push pull degenerated differential pair

Engineering AGMA pitting resistance factor of safety (SH)

PLL - How to find all the gains of a PI corrector and fix Ki ? MATLAB

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem