Speed up Sorts with 2 New Techniques: Sort-Lenses & Prefab-Sorts

basePARTICLE · Feb 27, 2008

A few new techniques exist to speed up sorts. Here are two.

(1) When sorting large strings and many of them it is better to use a sort-lens. A sort-lens is an integer array indexing the source data.

string data [n]
int sort_lens[n]

The data is referenced by: data[sort_lens[COUNTER]] ;

(2) When you implement your quick sort, and the number of elements left to sort are five or less, which makes the number manageable, you can use a prefab-sort. Here is the solution for three elements. Given three elements {c,b,a} such that sorted means [a,b,c], here is the static transformation. The state A, gives an unsorted sequence and the solution B, gives the exchange sequence to take A to sorted-A.
state [a,b,c] solution [1,2,3]
state [a,c,b] solution [1,3,2]
state [b,a,c] solution [2,1,3]
state [b,c,a] solution [2,3,1]
state [c,a,b] solution [3,1,2]
state [c,b,a] solution [3,2,1]

For elements less than 50, according to Knuth, in line sorts are faster. So this technique is based on both the quick sort and the in-line sort. I have implemented up to a four element in-line sort, for both integers and strings. I will add the five element in-line sort as soon as I get the time.

So here is a great addition for your qs() objects (your quick sort object).

rcgldr · Feb 27, 2008

Sort algorithms (plus links to zips of source code and related peformance stats) were discussed in this thread.

efficient sorting

The zips with my final versions of the sort code are here:

efficient sorting - links to zip of source files

These include Microsoft's standard template sort() (quicksort) and stable_sort() (merge sort) routines, as well as my own merge sort.

Sort a vector of 8 byte elements (Visual Studio, 32 or 64 bit mode):
sortv.zip

Sort a text file (Visual Studio, 32 or 64 bit mode) (also include test text files):
sortp.zip

For sorts of data that fit in memory, the time to sort is very small (the sortv.zip sorts took 95ms to 180ms for 1 million 8 byte elements), and the algorithm isn't that important. For sorts of larger amounts of data that require usage of external media, such as disk (or tape), then a merge sort is the fastest algorithm. Merge sort is also quickest if there's significant pre-ordering of data before sorting.

basePARTICLE · Feb 27, 2008

Merge sort, a quick search then a linear insertion - right?. Most of the work in this type of sort is moving data. A sort lens is ideal for this type of sorting because the sort times would be approximately constant, the times, given from moving only integer indexes.

p.s. depending on your project, original sequences of data can be important, as important as what you may find after sorting, therefore sort lenses are useful in certain types of mundane applications.

p.s.s. I saw goto's and thought it was basic!

rcgldr · Feb 27, 2008

basePARTICLE said:

Merge sort, a quick search then a linear insertion - right?.

As noted in the original thread, in Microsoft's stable_sort(), an insertion sort is used to create groups of 32 elements, before switching to the merge sort algorithm. I'm not sure if 32 is an optimal number for random data. In my msort1(), increasing and decreasing sequences are used to create groups of 2 or more elements (variable sized groups), taking advantage of any significant pre-ordering of data. msort0() just creates groups of 2 records and merges from there. Both msort0() and msort1() are faster than stable_sort(), but I don't know if it's because of the template overhead or because it's a bad idea to use insertion sort to create groups of 32 elements.

Most of the work in this type of sort is moving data. A sort lens is ideal for this type of sorting because the sort times would be approximately constant, the times, given from moving only integer indexes.

Depends if the indexes are the same size at the data. In the case of the vector sorts (sortv.zip), 8 byte elements were being sorted, and in 64 bit mode, the indexes are the same size as the data, so might as well move the data.

In the text file sorts (sortp.zip), pointers to elements are moved and the data is simply compared until the sort is done, and only then is data moved.

In the case of disk based sorts, it's always better to sort the data, because the time it takes to random access and transfer a single element is about the same as the time it takes to random access and transfer many elements, and in the case of a merge sort, up to 1/4 of available memory for data can be used for each transfer of a group of elements. With 4 devices, hard drives or tape drives, all I/O is sequential.

p.s.s. I saw goto's and thought it was basic!

The goto's are only used to go to a common exit point if there's an error (like insufficient memory or a bad file name) at the start of the program. Personally, I see virtually no difference between using multiple goto's or multiple returns in a program, except with older debuggers, it was easier to use goto's and set a single break point at the common exit point.

I have no qualms about using goto's rather than burden code with widely separated pairs of braces on nested if statements for error handling cases, or the classic alternative of using a status variable (if(Status == GOOD) then {...}) to chain a series of steps where each step depends on the success of each previous step (with no way to know if or where common code after such a sequence occurs in a function without scanning for the first chunk of code not prefixed by the if(Status==GOOD). Searching for "first occurance of something not there" is not programmer friendly. With a goto to handle the failure case of each step in a chain of steps, a programmer is searching for an existing label, rather than the absence of an if statement.

If there's going to be a discussion about goto's in general, start another thread, and I will be happy to post there about goto's. I'm hoping this thread remains related to sorting algorithms.

basePARTICLE · Feb 28, 2008

Sorry if I'm a little rrrusty, but how does someone know if the data has any significant pre-ordering, so use can be made of a particular sort object? In my time, O(rder) = N*log(N) was an indication of sorting prowess.

Here is a link that describes sorting results between the merge sort and the quick sort, differently from how you did. http://www.atkinson.yorku.ca/~sychen/ITEC2620/Sorting/nlognSorting.html" . This is a java demonstration. You can basically trust the results, I believe. I worked at that University, twenty years ago (not a real reason, but...).

My claim is I will linearily speed up that qs() with my in-line sorting of less than five elements, at the average cost of one if computation. The reason is simple. At the smaller sorting intervals, instead of quick sorting four elements, which will consist of at least another three quick sorts, the four elements are collapsed into their sorted order. The metaphysical base for the sort is no longer one, but four, so I may claim NlogN/4 which makes a difference for very large N. However in theory N/4 is N.

rcgldr · Feb 28, 2008

basePARTICLE said:

Here is a link that describes sorting results between the merge sort and the quick sort, differently from how you did. http://www.atkinson.yorku.ca/~sychen/ITEC2620/Sorting/nlognSorting.html" . This is a java demonstration. You can basically trust the results, I believe. I worked at that University, twenty years ago (not a real reason, but...).

The merge sort demo at that website uses the same inefficient algorithm that wikipedia and so many other web sites describes as a "merge sort", one of my pet peeves. It's a "top down" "divide and conquer" algorithm that is significantly different than a "classic" merge sort, which is now called a "bottom up" merge sort. I've added comments to the wikipedia article in the discussion section about this. You have to do a web search for "bottom up merge sort" to easily find references to the classic version of a merge sort.

Here is a link to a website that includes an animation of a "bottom up" merge sort as well as the other types of sorts.

http://www.csc.depauw.edu/~bhoward/courses/0304Fall/csc222/sort

A "bottom up" merge sort normally consists of two phases. A makegroups phase where some algorithim is used to create the initial groups of elements, and then a mergegroup phase where the initial groups are repeatedly merged until there is only a single group of elements. A "natural" "bottom up" merge sort looks for sequential sequences of ascending or decending elements during the makegroups phase, decending elements are handled by reversing the pointers or indexes to elements, or decending groups are handled during the first merge pass, especially if sorting data directly. A simple makegroups function could create groups of 2 elements, swapping elements (or pointers) if not ascending. Simpler still is no makegroups phase, and just starting of with groups of 1 element each. The overhead for a "natural" merge sort is that the variable group sizes require an additional array of counts for the number of elements in each group, but note that the size of this array of counts is cut in half on each merge sort pass, so it's only a significant overhead on the early merge passes when there are a relatively large number of small groups.

I have a "natural" makegroups as an option in the text sort routines. If the define for VARSZGRP is set to 1, the "natural" version of makegroups is used, and if it's set to 0, then the create groups of 2 elements version of makegroups is used.

A top down "merge sort" first recursively divides data into smaller and smaller groups until it's gets down to groups of 2 or less, or to less than some fixed size where it uses some other sort algorithm, before it actually begins a true merge sort.

In the animated case from the link above, the "bottom up" merge starts off with group size = 1 (no makegroups phase). The "benchmark" is based on the number of operations, and based on it's "operations" count, quicksort is much faster than bottom up merge sort on random data, but my actual testing shows the cpu usage to be close when sorting random data (the sort data directly merge sort was about 15% slower than quicksort in 64bit mode, with less difference in 32 bit mode), but the merge sort was faster when using pointers or indexes.

The main reason "top down" merge sort is shown on so many web sites is that its an inplace sort algorithm, similar to other "divide and conquer" sort algorithms used. A "bottom up" merge sort requires double the amount of memory for the list of pointers or indexes to elements, or worse yet, double the data space if directly sorting elements. Perhaps the "bottom up" merge sort is considered "unfair" since it uses "double the memory". In addition, a "natural" "bottom up" merge sort will require an additional array to keep track of the element count for all the groups created during the make groups phase, since the group sizes will be variable based on the amount of pre-ordering in a file.

One of the major shortcomings of a "top down" merge sort is that it can't directly take advantage of any natural ordering. In addition, the recursive nature adds overhead with all the indexes that get pushed and popped on and off the stack. It's almost doing double the work of of a normal "bottom up" merge sort.

The classic, now called "bottom up" merge sort was a tape sort using 4 tape drives, using all sequential I/O. (3 is the actual minimum but it requires double the number of passes). One reason for using tape drives is that with multiple heads reading and writing in parallel, they could transfer streaming data faster than disk drives. The same is true today, but the fast tape drives cost $5500 or more, so it's not worth it when compared to a raid of hard drives, since hard drives are so relatively cheap these days.

If you look at the benchmark data from thread I linked to before, the trade off is that quicksort does more compares, but moves less data. This is helpful if the sort is actually sorting data, but in the case of pointer or indexed based sorting, the comparason overhead is generally higher than the movement of pointers, at least in the case where the element size and number of elements is large, and elements are similar enough that comparasons of pairs of elements have to compare a significant amount of data before detecting a difference.

Sorry if I'm a little rusty, but how does someone know if the data has any significant pre-ordering, so use can be made of a particular sort object? In my time, O(rder) = N*log(N) was an indication of sorting prowess.

A "natural" merge sort, auto detects this, during the makegroups phase. If the data is already sorted, it does N-1 compares and moves no data, O = N-1 in this case.

My claim is I will linearily speed up that qs() with my in-line sorting of less than five elements, at the average cost of one if computation.

I'm not sure about this, since it wasn't done in the quick sort that is part of the standard template library (at least the one from Visual Studio). If you have Visual Studio (note that Visual Studio Express is a free downloadable tool from Microsoft), you can look at the code in the template include file.

...\Microsoft Visual Studio 8\VC\include\algorithm

sort() is a quicksort, and doesn't change it's method based on node size unless it decides there are too many nodes (bad mid-value guess), and switches to a heap sort for the problem nodes.

stable_sort() is a standard bottom up merge sort, and does use insertion sort to create groups of 32 elements in it's makegroups phase before beginning the actual sort. I didn't bother, but since "algorithm" is a source file, the 32 is an easily changed value.

Since this is a standard template "library", and based on the comments at the end of the source file, it appears that these are generic C++ standard algorithms as oppposed to Microsoft specific algorithms, although Microsoft may have made some tweaks to the algorithms. I don't know if "algorithms" is include as a part of the "standard template library" for non-Microsoft C++ compliers.

(After a bazillion edits to this, I think I'm finally done with this post).

basePARTICLE · Feb 28, 2008

I like your style!

Here is a report I found on your peeve at http://www.cs.grinnell.edu/~rebelsky/Courses/CS152/2004F/Outlines/outline.46.html" where the technique is called An Iterative Merge Sort.

I admire your information gathering processing addition to the merge sort. It is surprising how much information can be obtained from simple comparisions.

My favourite sort looks like this:
In one pass from n-1 comparisions, counting all elements less than the element of interest, and marking them at the same time, places the element of interest in itz position and subsequently sort elements less that it and those greater than or equal to it.
Groups of four or less are rearranged using the in-line sort. I should test itz speed! I will download that package you mentioned, but it will take some time...

rcgldr · Feb 28, 2008

Here is a description of a 4 tape sort, which is a standard "bottom up" merge sort:
http://www.colchsfc.ac.uk/computing/computingnotes/sortsearch/ssfourtape.htm

In the case of tape drives, you don't need any group counts, since single file marks could be used to denote the end of groups, and a double file mark used to denote the end of data on each tape.

If there were 6 tapes drives, then groups could be merged 3 at a time for log3(N) passes. 8 tape drives would be log4(N) passes.

In addition, clever things like reading and writing tapes in reverse direction and merging decending sequences during reverse passes eliminates rewind time. Plus if the sort program "knows" it's the last pass, the final merge pass goes back to a disk based file.

In the case of a memory sort, there not much to be gained by merging more than 2 groups at a time, because most of the overhead is in the comparasons; data isn't being moved, just pointers or indexes, and the number of comparasons ends up being about the same or maybe worse if more than two groups are merged at a time, partly because when the end of one group is reached, the remaining element pointers in the second group are simply copied with a memory move, so smaller number of larger groups is better than larger number of smaller groups. Similarly, for a 1 to 3 disk drive sort, the overhead of random accessing also takes away any advantage of trying to merge more than 2 groups at a time (at least I think this holds for up to 3 disk drives, I'm tired and haven't fully thought this out).

For tape based sorts, there's the clever "polyphase" sort:
http://www.colchsfc.ac.uk/computing/computingnotes/sortsearch/sspolyphase.htm

There is a typo in the pictures, the 59 becomes a 57 during the sort, and then reverts back to a 59 when done.

Here the difference is that merging is done 3 groups at a time instead of the typical 2 at a time. Plus there's some cleverness in writing a different number of records to each tape in the initial pass, with the intent of leaving just enough data on the "longer" input tapes after the end of a merge pass (determined by the tape with the fewest number of records), to continue the merge process for the next pass. At the end of a merge pass, input tapes with the fewest number of records has reached the end of it's data, and that tape is the tape that will become the output tape on the next pass. Note that the number of records per group doesn't double with each pass since new groups from the previous output tape are merged with the groups left on the tapes from previous passes.

Note, I'm not that old, 56 now, but I started out learning how to program in high school (rare for the late 1960's), and part of my learning process allowed me to see some historical (for the late 1960's) stuff in action. This stuff dates back to pre 1960 and machines like the IBM 1401, which makes me wonder why so much time has been spent "reinventing" sort algorithms.

jim mcnamara · Feb 28, 2008

Note, I'm not that old, 56 now <snip>
which makes me wonder why so much time has been spent "reinventing" sort algorithms.

I'm more than ten years older than that.

Answer: Try to afford a Syncsort license for your company. There's gold in them thar sorts! Seriously,the folks who get wound up in this stuff nowadays have not bothered to read Knuth. So they don't know about basic properties of sorts or tag sorting or whatever. That's the real answer you want - they haven't seen the stuff. Or you could say: they are out of sorts

Decent sorting software has at it's disposal a variety of algs, finds the best alogrithm to use, and also uses decomposition sorting - changing algorithms in midstream on reduced or decomposed lists. All of this requires a lot of up front knowledge, sort of a poorman's heuristics for choosing sort algorithms.

Part of the upfront knowledge is information about system resources:
whether you need tempfiles and if you have resources to allocate them
how much memory can you allocate

The other part comes from analyzing the datastream as you build the keys, tags, etc. I've seen too many routines that will try to sort already correctly ordered data, for example.

Most of this upfront work is ignored by posts on the subject of sorting. Everyone goes off the deep end on tweaks to algorithms.

YMMV.

basePARTICLE · Feb 28, 2008

Jeff Reid said:

Here is a description of a 4 tape sort, which is a <snip>
Note, I'm not that old, 56 now, but I started out learning how to program in high school (rare for the late 1960's), and part of my learning process allowed me to see some historical (for the late 1960's) stuff in action. This stuff dates back to pre 1960 and machines like the IBM 1401, which makes me wonder why so much time has been spent "reinventing" sort algorithms.

Here is one reason. Suppose a chip was able to compare N elements, in a Vector V, using an NxN matrix, each element with the other in 1 clock cycle, and keep some information! The next cycle, computes for M[i,i] how many larger, smaller, and equal elements. I would think the following cycle V would be sorted. This can be done in approximately three clock cycles. Ok add a few more clock cycles, for comfort.

I think sorts are reinvented because some people consider, NlogN or Nlog2N, something to beat.

p.s. sorry people, I have been working with ranomd data for the last twenty years...

rcgldr · Feb 28, 2008

why so much time has been spent "reinventing" sort algorithms.

What I meant by this is the "rediscovery" of already known algorithms, not the true invention of new algorithms. However even on the internet, it'd difficult to find good stuf on sort algorithms because it's mostly a science of the 60's and 70's, predating the internet, so only a portion of this knowledge ended up on the internet as opposed to books sold for profit.

Personally, I wrote my first disk based merge sort program back in 1974, the other programmers were amazed at the relative speed, mostly because other methods involved a lot of random accessing of single records as opposed to the multi-record pseudo-sequential I/O that works so well with merge sorts. Even an index based sort that took zero time would be slower than merging data because you'd typically get only 1 record per random access via an index. The overhead for random access versus sequential access is like 30 to 1 or higher, so unless a merge sort of the actual data (as opposed to indexes) takes 30 passes, it's faster.

A lot of the stuff used in commercial high end sorts, like CASort or SyncSort is proprietary, because a lot of this stuff was done before software/algorithms could be copyrighted.

The later stuff is copywrighted, but early software patents were too easily granted. One of the patents for sorting that I found by accident when looking for data compression patent (I was working with a company tyring to find the compression algorithm with the lowest "royalty" cost == "fewest" or "cheapest" number of patents to buy), did something I thought was extremely obvious, at least for any computer with memory mapping. The algorithm simply took advantage of the fact that a single I/O to an external device could be composed of multiple descriptors for data scattered through out real memory, avoiding movment of data, but this is done on any memory mapped system, since the physical location of data in mapped memory will be scattered. The only cleverness is usage of the existing hardware descriptors on such machines to transfer data a record at a time, rather than by page (or partial page for first and last pages of a transfer). I'm not sure what the limit on transfer size is on a mainframe, but even older PC SCSI adapters had at least 17 descriptors to support up to 64k transfers, since Intel memory map size was normally set to 4096 bytes (by Windows, the other Intel option is 4MB, not too useful). The 17th descriptor was needed since a data transfer would unlikely start on a page boundary. Other SCSI controllers had 255 (byte counter limitation, otherwise why not 257?) descriptors, and I haven't found the limit for native IDE / PATA / SATA transfers other than it's at least 2MB on most PC motherboards.

Suppose a chip was able to compare N elements, in a Vector V, using an NxN matrix, each element with the other in 1 clock cycle, and keep some information!

The issue with hardware to compare multiple elements is that in order to do the compares, data has to be read from scattered locations in memory, and I'm not sure how this could be optimized. A "bottom up" merge sort does lends itself to parallel processing through, especially during the makegroups and early merge groups phase, where many small groups of elements are being created or merged independently of each other.

I think sorts are reinvented because some people consider, NlogN or Nlog2N, something to beat

In the previous thread, radix or "bucket" sorts (non comparative) sorts were discussed as either standalone sorting or in my case, during the makegroups phase of a merge sort. Here the more uniform distribution of random data, the better a radix sort will help. For example, the most significant byte could be used to separate the elements to be sorted into 256 separate arrays. Then the makegroups pass would simply grab the first element from each of the 256 arrays to create 256 element groups, with some special handing for smaller groups being created as each of the 256 arrays was emptied, but note that "natural" "bottom up" sort already deals with variable sized groups, so the mergegroups phase of code wouldn't require any changes.

Another consideration discussed but not investigated, was exactly how the memory cache algorithms of various CPUs would affect various optimizations, such as the makegroups based on radix sorting into multiple arrays.

Again, note that it took between .1 and .2 seconds to sort 1 million 8 byte records on a PC depending on the algorithm used. So most sorting programs are more of an intellectual exercise than a time saver. Still I find it interedting, since it's something I did a long time ago, and virtually stopped doing when I started working on embbeded software (multi-tasking firmware) back in 1987.

rcgldr · Mar 2, 2008

For example, the most significant byte could be used to separate the elements to be sorted into 256 separate arrays.

If anyone is interested (besides me), I could code this up as an #define option, add it to my source in the zip files and compare peformance.

One thing that has chaned since the old days, is that 2GB of ram on a desktop computer isn't that rare these days, so sucking up memory to speed up sorting is more of an option than in the old days.

Hoping Hurkyl might throw in a comment here.

basePARTICLE · Mar 2, 2008

Jeff Reid said:

If anyone is interested (besides me), I could code this up as an #define option, add it to my source in the zip files and compare peformance.

One thing that has chaned since the old days, is that 2GB of ram on a desktop computer isn't that rare these days, so sucking up memory to speed up sorting is more of an option than in the old days.

Hoping Hurkyl might throw in a comment here.

I can relate a since the old days story which is related to sorting speeds. Yes the year was 1988, and the chip set in use was Intel's 80386. There was a memory mangement card called the ALL-card, which sat between, the mother board and the 386, that was re-interpreting the memory addresses. To cut the story in half, when it was my turn to implement the XMS driver for the global market, the senior engineer, had wanted me to keep the memory blocks in sorted order, as he believed the whole process would be faster. I had disagreed with him, noting that a macro instruction ffs (find first bit set) on a bit string of tags, would uncover free blocks of memory faster than keeping the memory blocks in some sort of sorted order using linked lists or a sorting mechanism.

The point of this story is, in certain cases, unless the data is properly analysed, programmers, would run off and start sorting, looking for speed increases where there is actually none. Needless to say, by 1989, that XMS driver, held the position as the fastest, most robust, XMS driver of itz time.

Jeff Reid, go ahead, remember for some, there are always considerations to be taken into account, when readying oneself for sorting, so the most current techniques which correspond to real situations are always positive packets of information for assimilation.

Upgrade it...

Hurkyl · Mar 2, 2008

Jeff Reid said:

I don't know if "algorithms" is include as a part of the "standard template library" for non-Microsoft C++ compliers.

The C++ standard only specifies observed behavior and complexity constraints. For example, I believe for sort(...), all the standard says is that it should sort the data, be O(N²) worst-case, and O(N log N) average case. (e.g. both quicksort and merge sort are permissible implementations)

(complexity is probably measured in comparisons and element copies. Standard probably also says things like there should be no side-effects)

Jeff Reid said:

Hoping Hurkyl might throw in a comment here.

I don't really have much to add at the moment.

rcgldr · Mar 2, 2008

Hurkyl said:

I don't really have much to add at the moment.

You're comment about the C++ standard helped. I wasn't sure if standard template "library" implied the exact same source, or just functional equivalence. Now I realize it's functional equivalence, so thanks.

rcgldr · Mar 26, 2008

I finally made a hybrid radix + merge vector sort. The radix sort first splits up the data into multiple bins based on the leading bits of elements, then each bin is merge sorted, and then the sorted bins are moved to form the sorted vector. Note that any sort could be used to sort the individual bins, the point here was to show the improvement of sucking up memory to utilize a radix method to split up the work.

These programs sort a (standard template library) vector of 4,194,304 pseudo-random 64-bit elements (unsigned integers). This is 32MB (33,554,432 bytes) of data.

The hybird sort uses 32 bins (this is easily changed, just 2 defines), which each could end up holding all the data, so that's 32MB x 32MB = 1GB of data, plus some overhead, so it probably needs 2GB of ram on a PC to run, or a reduction in bin count (16 bins would require 512MB plus overhead). It's piggy, but it's fast.

I didn't create one, but a "card sort" type of radix sort could be used if element size is reasonably small, eliminating any comparasons, but it requires 2 times the number of bins, or twice the number of data moves. Each pass uses the current least significant bits of elements to separate them into a set of bins, then those bins are processed in order, with the next least significant bits of elements, to separate them into the other set of bins. When done, the elements in the last set of bins are moved into the output vector. The alternative is to use a single set of bins and move to the output vector on each pass.

Results:

Code:

Time to sort 4,194,304 64 bit elements of psuedo random data:

hybrid (radix+merge) sort  hsort.cpp     352 ms
microsoft sort             sort.cpp      461 ms
merge sort                 msort.cpp     539 ms
microsoft stable_sort      sort.cpp      554 ms

64 bit mode, Windows XP64, Visual Studio 2005 compiler.
Intel core 2 X6800 CPU, 2GB 4-4-4-10 DDR2 ram, Intel D975XBX motherboard.

Link to zip of source files, compatable with Visual Studio Express (free from Microsoft), in either 32 bit or 64 bit mode:

sortv.zip

rcgldr · Mar 30, 2008

I added the hybrid radix + merge sort to group of text file (pointer based) sort programs. I also updated the test file, src.txt. It is now based on a binary image of pi, using the bottom 7 bits of each byte plus 0x20, so each record is 10 bytes long: 8 pseudo random data bytes, range 0x20 to 0x9f, followed by the 2 byte sequence, 0x0d 0x0a (cr lf).

This time the microsoft stable_sort(), a merge sort, was faster than sort(), a quicksort (quicksort had more compares).

The hybrid sort uses 161 bins, 160 bins for pointers to records that begin with a byte in the range 0x00 to 0x9f, and 1 bin for pointers to any records that begin with a byte greater than 0x9f (zero of these in my test file). Update - The test source file's data doesn't have any values below 0x20, so this reduces the number of actual bins used to 128. In 64 bit mode, you need to run it on machine with 2GB of ram (or reduce the number of bins). In 32 bit mode, it may run on a machine with 1GB of ram.

Code:

Time to sort 1048576 pointers to
10 byte text records of psuedo random data (8 bytes data, cr, lf):

32 bit mode is quicker because 32 bit pointers are being sorted as opposed
64 bit pointers (1/2 the amount of data moved, compares are the same).


64 bit mode, Windows XP64, Visual Studio 2005 compiler.
Intel core 2 X6800 CPU, 2GB 4-4-4-10 DDR2 ram, Intel D975XBX motherboard.

hybrid (radix+merge) sort  hsortp.cpp     296 ms
merge sort                 msortp.cpp     454 ms
microsoft stable_sort      sortp.cpp      578 ms
microsoft sort             sortp.cpp      704 ms


32 bit mode, Windows XP, Visual Studio 2005 compiler.
Intel core 2 X6800 CPU, 2GB 4-4-4-10 DDR2 ram, Intel D975XBX motherboard.

hybrid (radix+merge) sort  hsortp.cpp     265 ms
merge sort                 msortp.cpp     375 ms
microsoft stable_sort      sortp.cpp      437 ms
microsoft sort             sortp.cpp      609 ms

Link to zip of source files plus test file, compatable with Visual Studio Express (free from Microsoft), in either 32 bit or 64 bit mode:

sortp.zip

rcgldr · Mar 30, 2008

More comments on the hybrid sort. The overhead for the hybrid sort is movement of data into the bins and out of the bins, equivalent to 2 merge passes. Using a merge sort on the bins is an issue because the number of passes inefficiently increases by 1 if the element count in a bin is slightly more than a power of 2. Under ideal circumstances, a 4 bin hybrid sort takes the same time as a standard fixed group (as opposed to using existing ascending and descending sequences to create variable groups during the make groups initial phase) merge sort. Under bad circumstances, the data may not be easily separated into near even numbers of elements in each bin, and the hybrid radix / bin sort loses out to a standard sort. A hybrid sort using quicksort on the bins should be faster if the data is close to random, and the length of compares isn't long.

The original question from the previous thread was which was faster, a quicksort or a mergesort, and the answer is that it depends on the data. Quicksort does more compares, but moves less data if the data, depending on the data pattern, near random data works well. If the compare overhead becomes significant due to having to compare more data before finding a difference between elements, the quicksort is slower than a merge sort.

update - If a merge sort merged 4 groups at a time instead of 2, then the number of total compares increases by a factor of 1.5, but the amount of data moved decreases by a factor of 2. This would probably be faster than a quicksort, depending on cache hits, but slower than a 2 group merge sort if the compare overhead is 1.33 or greater than that of movement of data.

The fastest sort time occurs when there is significant pre-ordering of data, and a merge sort that scans for ascending and descending sequences of elements when creating the initial groups to be merged is very fast. In the case of a pre-sorted file, such a merge sort takes the ideal minimum time, n-1 compares and no moves.

I think I'm done with this sort stuff for now. Considering that these sorts took less than a second to sort 32 megabytes of data, there's not much practical difference. Businesses with huge files will be using purchased sort programs, rather than writing their own. Wondering if even Hurkyl is looking at this thread anymore.

Speed up Sorts with 2 New Techniques: Sort-Lenses & Prefab-Sorts

What are Sort-Lenses and Prefab-Sorts?

How do Sort-Lenses and Prefab-Sorts improve sorting efficiency?

Can Sort-Lenses and Prefab-Sorts be used with any type of data?

Are Sort-Lenses and Prefab-Sorts widely used in the scientific community?

Are there any potential drawbacks to using Sort-Lenses and Prefab-Sorts?

Similar threads

Hot Threads

Recent Insights