1. Not finding help here? Sign up for a free 30min tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Sorting Algorithm Performance - Interview-Based Problem

  1. Mar 6, 2012 #1
    Hi guys,

    My professor recently asked a challenging question - he says it's been asked on an interview and he wanted to pose it as a homework.

    Attached is the graph of sorting algorithm performance. The interviewer asks: "Tell me everything you can about this algorithm just by looking at the data on the plot." Obviously the idea is to characterize performance and nature of the algorithm. It's a sorting algorithm operating on strings. The x-axis is the number of elements in the input and the y-axis is the tick-count to complete the sort. That's the universe of information.

    Does it even sound reasonable? It's the first time I find such a question on algorithm analysis. Usually the questions are of the form: "... given this code, characterize the complexity of this algorithm ... " but this one sort-of in reverse. I believe the algorithm is not standard but it is an interview-based problem, so I'm assuming it's reasonable to try it hard.

    Thanks,

    Monte
     

    Attached Files:

  2. jcsd
  3. Mar 6, 2012 #2

    berkeman

    User Avatar

    Staff: Mentor

    What are your thoughts so far? Looking at the graph, a number of things come to mind about the input data and the algorithm...
     
  4. Mar 6, 2012 #3
    I think it may be an O(n lg(n)) algorithm but I can't explain the downward trend observed in the end. It definitely doesn't look like it's linear in the input size.
     
  5. Mar 6, 2012 #4
    Also, I attach the time series.
     

    Attached Files:

  6. Mar 6, 2012 #5

    berkeman

    User Avatar

    Staff: Mentor

    The interviewer apparently doesn't give you any information about the input strings. What are your thoughts on options for the input strings? What effects could that have on the plot?
     
  7. Mar 6, 2012 #6
    Well he didn't say anything in addition, but I guess the idea was to gauge candidate's performance based on the questions s/he will ask the interviewer.

    May be the extent to which the input is sorted is correlated with input size, i.e. larger input size - more sorted -> less time to sort? Maybe that's a variable which is responsible for an apparent speed with larger inputs.

    What are your thoughts? What about complexity?
     
  8. Mar 6, 2012 #7

    berkeman

    User Avatar

    Staff: Mentor

    Like I said, I have a number of thoughts that come to mind in looking at the graph. But after all, *you* are the one being "interviewed"... :smile:

    EDIT -- Try "talking out loud", as if you were actually in an interview. Talk about the possibilities for the input data, and how that could create some of the patterns.... Talk about the algorithm, and some possibilities for why the performance improves for the larger data sizes... Ask some questions that would help you to figure more of it out...
     
  9. Mar 6, 2012 #8
    The performance doesn't seem to be linear in the input size, that's for sure. The speed up is not abrupt, as would be expected if there was a conditional, say like if an element to be sorted is longer than some cutoff value then disregard from the input, so that doesn't seem like an option. I think it has to do with the way how the data is pre-sorted. It seems like larger inputs are already sorted so require less sorting done on them and the running time is smaller. Also, I think that the professor pointed out that the answer should come from close examination of the data.

    You sound like you see through the problem, so I'm all ears.
     
  10. Mar 6, 2012 #9

    berkeman

    User Avatar

    Staff: Mentor

    Some good thoughts there. How do you think the graph would look different if the input string data were completely random?

    And on the question about the improvement in sorting for the larger string sizes, what reasons could you think of for this?

    Any thoughts on what the strings look like, based on the graph? What order were the strings given to the algorithm?
     
  11. Mar 6, 2012 #10
    What I think is that there is a variation in how sorted is the array depending on array size. I believe that around 570 elements, the inputs (i.e. the array to be sorted) is becoming somewhat unsorted corresponding to the rise in the time to sort, but then that levels off because the inputs become more sorted.

    If the inputs were uniformly unsorted all the time (i.e. any particular string in the input could occur in any position with equal probability) I think would see a classical n-lg-n shape without the observed decline. I believe the divergence between that scenario and the observed begins at around input size 570 or so.

    You mention larger string sizes, do you think strings have variable length? If so, what makes you suppose that? Is it something in the data?
     
  12. Mar 6, 2012 #11

    rcgldr

    User Avatar
    Homework Helper

    The title of the graph is "clock ticks vs input size". If element size decreased with number of elements, then the title of the graph would be mis-leading. The title of the graph implies average or actual element size is near constant.

    There's no reason for a larger amount of data to tend to be more pre-sorted or "algorithm favorable" than a smaller amount of data, unless there's something special about the data, but that's not the question being asked.

    I'm not aware of any sort algorithm that becomes faster with more data. Take the simplest case of a natural merge sort with presorted data, it does n-1 compares and declares the data is already sorted, so this would be O(n-1). Howver the graph falls off with a larger number of elements, so it's less than O(n). If the data was fitted with a polynomial equation, you'd have a negative coefficient for the higher terms (x3 , x4).
     
    Last edited: Mar 6, 2012
  13. Mar 6, 2012 #12
    Ok, but what are your conclusions? What explains the observed dynamic? Are you analyzing data in Excel or just looking at the chart? I recall the Pr. saying that it's better to look at the data in Excel but I don't know why. You can easily paste the data from the time-series into Excel.

    Why would the chart be misleading if the element size was a variable? I'm not challenging you, I just need to know so I can explain this to the professor. Just so that we're on the same page, 'Input size' means count of elements in the array to be sorted. I don't see how title of the chart implies near-constant or average element size. May be it's something supposed in comp sci literature? I haven't seen it so far, maybe it's assumed in the textbooks but I don't know if this is assumed otherwise.

    The question is: "Tell me everything you can about this algorithm just by looking at the data on the plot." There is no statement about input data other than the above. Just the time series, the fact that the algorithm is a sorting algorithm and it operates on strings. Berkeman seems to have good ideas.
     
  14. Mar 6, 2012 #13

    berkeman

    User Avatar

    Staff: Mentor

    Remember that this is your assignment, so we won't be offering any conclusions. We can ask questions and give hints, but the bulk of the work needs to be yours.

    I agree that it seems implied that the element size is a constant, and only the number of them is varying.

    I also agree that if the input strings were completely random, I don't think there would be some of the characteristics that you see on the graph.

    Would it make any difference, maybe, if the experiment were run from left to right? As opposed to starting with the larger string runs first? What are your thoughts on that?
     
  15. Mar 6, 2012 #14

    rcgldr

    User Avatar
    Homework Helper

    Initially by just looking at the data, I've since done a quick curve fit using a program I wrote. Excel could be used to do a similar curve fit. I'm not sure this helps much other than to confirm that the number of clock ticks tended to decrease once number of elements got beyond a certain size.

    Variable element size would be OK as long as average element size doesn't vary with number of elements. Element size decreasing with number of elements would be a conflict. "Input size" implies an amount of data, (like megabytes). If the element size decreased with number of elements, the graph title should be "clock ticks vs number of elements".

    Generally with a larger number of elements, you'll find more elmenents where the leading characters of the field to be compared are the same, requiring more time to do a compare, but the chart is showing an opposing trend.

    For a sort program that operates on strings, ignoring the sort algorithm itself, what type of data structures would you use in a program to sort strings, which may be variable length strings?
     
  16. Mar 6, 2012 #15
    You could be sure that the x-axis is the number of elements in the array, so yes, the chart should be renamed accordingly 'Clock Ticks' vs 'Count of Input Elements'. This was very apparent from the discussion in the class.

    We're making good progress so far. The word 'complexity' still rings in my head. I agree the ideas should be mostly mine. I think it's n-log-n but I don't know if I'm looking at the data in the right way. You mentioned curve fit in Excel. Anything you can say on the basis of that maybe?
     
  17. Mar 6, 2012 #16

    rcgldr

    User Avatar
    Homework Helper

    If it was n log n, the curve would continue to get steeper as the number of elements increased, but instead it levels off and starts diminishing. There's more going on here than just a sort algorithm. Somehow the data itself is affecting the results, but the only information provided is that as the amount of data increases beyond some point, the time to sort that data decreases, which wouldn't occur in a typical situation.
     
  18. Mar 7, 2012 #17
    Hi guys,

    I just got the answer to the problem - it turns out the algorithm is a plain bubble sort with variable-length strings. The length of inputs varies with counts of elements in a linear fashion (inverse linear) so for small input sizes we've got long strings and for larger element counts we've got shorter strings. That's enough to explain the observed dynamic.

    Somebody in the class proposed the issue explaining the graph could equally have been sortedness of the inputs and that quadratic trend observed in the beginning could have been quicksort's worst case. But the prof said it's very rare to observe quick sort operate in quadratic fashion. He showed how to fit the curve for the first few data points - there are some outliers to obfuscate - but the point is clear - quadratic fit is the right one.

    Anyway, I've never seen any problem like this. If that's what's being asked on the interviews, then I need lots of luck. Any ideas where to find similar problems?

    How about this - maybe somebody on this forum can create such problems for people to exercise with?
     
  19. Mar 7, 2012 #18

    rcgldr

    User Avatar
    Homework Helper

    I already brought up the issue that the graph was titled clock tick versus input size as opposed to clock ticks versus number of elements (where element size decreased with number of elements) which I suspected. What was the actual title used on the graph for this problem?

    If the data size is fixed, and element size decreases lineary with number of elements, with only the number of elements changing, then the amount of data would be huge, as it would have to be the smallest common multiple for every number 10, 50, 90, ... 1970, 2010 (= 10 (1 + 4 m), where 0 <= m <= 50), since the graphs x axis includes all those sample points (over 1040 elements, by my guestimate).

    Ignoring that issue with all the sample points (what is the total data size when there are 1970 elements versus 2010 elements), even if element size did decrease linearly with number of elements, I question if the outcome would match the results shown on the graph. Say there's 1 mega-byte of data, so it may be sorting 10 strings of 100,000 bytes each or 2000 strings of 5,000 bytes each. The number of operations increases by n2 and the time for a swap only decreases by 1/n. The compare time would depend on how many characters had to be compared before a difference is found, and it's probable that the compare time isn't significantly affected by element size, because any pair of strings would differ well before the 5,000th character. So even in this case, the number of clock ticks versus number of elements isn't going to fall off the way it does in the graph.

    Typically when sorting strings, an array of pointers to strings is used, with only the pointers being moved or swapped and the strings only used for compare, except for final output (one set of moves for the strings). In this case, the number of clock ticks would versus number of elements sorted would not reach a peak and fall off as it does in the graph.

    There are other sort algorithms with similar overhead to bubble sort, such as comb sort or gnome sort. Wiki's list of sort algorithms:

    wiki comparason of sort algorithms.htm

    So as mentioned before, if that graph is to be believed, then it was the data affecting the outcome, not the sort algorithm. Is there an actual set of data that when sorted by bubble sort produces the graph shown (using clock ticks as a relative number)?

    If the data is allowed to be manipulated, then different sets of data for each case could be choosen such that the worst case occurs for any sort algorithm with a worst case of O(n2), even if it's average case is O(n log n), and choose sets of data where the worst case doesn't occur with larger number of elements.
     
    Last edited: Mar 7, 2012
  20. Mar 7, 2012 #19
    Hi rcgldr,

    Yes, I checked with professor, the graph title shouldn't have included word: "Size" though he did say the x-axis itself is labeled correctly, and ordinarily when presented with algorithm performance chart, the size of input implies count of elements, not bytes.

    I got the source code which generated the data (attached) and yes, it does look like there is a linear decrease in element size as the element count increases. The code is c++, compiles easily in CodeBlocks (I checked). One thing I noticed is the comparison of the strings that was used, though on second thought, I don't think it really matters - it's linear in string length. I don't see how it can be any better. Probably c++'s own comparison operator is faster but I guess the point is that string comparison can't be constant time due to variable length. Their string comp algorithm though seems to be taking only strings of equal length - which is consistent with the way it's used.

    I liked the data-analytics orientation of this problem. Again, I've never seen anything like this in a textbook on algorithm analysis. Mostly the exercises ask to infer performance from the code. Any more such problems would be greatly appreciated.
     

    Attached Files:

  21. Mar 7, 2012 #20

    rcgldr

    User Avatar
    Homework Helper

    Now I see the problem with that code example, the amount of data being sorted greatly depends on the number of elements. Looking at testsort(), string length is set to 20000-10*n, and with n strings, the amount of data being sorted is:

    amount_of_data = n x (20000 - 10 n) = -10 n2 + 20000 n

    So the amount of data versus n is a parabola with a peak at n == 1000, some sample points:

    Code (Text):

       n     amount of data sorted

    0010 | 00199000
    0050 | 00975000
    0090 | 01719000
    0100 | 01900000
    ...
    0970 | 09991000
    1000 | 10000000
    1010 | 09999000
    1050 | 09975000
    ...
    1890 | 02079000
    1930 | 01351000
    1970 | 00591000
    1990 | 00199000
     
    To fix this, change testsort() so that string length = 10000000 / n.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Sorting Algorithm Performance - Interview-Based Problem
  1. Bubble-sort algorithm (Replies: 3)

Loading...