Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Efficient sorting

  1. Aug 24, 2007 #1

    rcgldr

    User Avatar
    Homework Helper

    Moderator note: This was split from https://www.physicsforums.com/showthread.php?p=1409789#post1409789

    Perhaps this response should go into it's own thread regarding sorting ...

    yes.
    I think you mean sorts that use very little storage overhead. A merge sort typically doubles the required storage space.
    Radix sort is a type of merge sort. A good merge sort will handle variable size groups, which are created in the initial "scan" by forming groups out of any increasing or optionally decreasing sequence of elements. Groups with decreasing sequences are "reversed" back to increasing sequences when written to external storage, or during the first actual merge pass. Each merge pass merges pairs (or triplets if radix 3 sort, quadlets, if radix 4 sort, ...) of groups, reducing the number of groups by 1/2 (or 1/3 or 1/4 ...) and increasing the size of the groups. The final pass occurs when the output will be creating a single group, usually outputting in desired format to external storage if required.

    A merge sort is ideal for external storage, since multiple elements can be read or written with a single I/O. With 4 external storage devices, a radix 2 merge sort involves all sequential I/O, and can be done on tape drives (as was done on early systems). 3 is the bare minimum for a tape sort, but it doubles the number of passes as the output tape has to be re-read to split it back up to the other two tape drives for each pass.

    Optimizations for external storage include taking advatange of hardware that can handle scattered data with a single I/O operation, meaing that memory is randomly accessed during a single write operation (no point in doing this for reads). In spite the obviousness of this, since this is how any bus mastering I/O operation with mapped memory has to be done (since the logically sequential mapped segments are physically scattered), some Japanese company managed to get a patent on this, but I don't know if the patent is still valid.
     
    Last edited by a moderator: Aug 24, 2007
  2. jcsd
  3. Aug 24, 2007 #2

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    It's several million to sort, if ?I do it that way, and the quicksort works just fine. Radix sort can be worse than mergesort and the rest when the numbers are sparse, as they are here -- I wouldn't even try it on this dataset.
     
  4. Aug 24, 2007 #3

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    Jeff and Hurkyl:

    Merge sorts are pretty good, I'll grant that, but at the moment sorting is < 20% of the total runtime if I set up the main part of the program to use a binary search (sparse array, essentially) and 0% of the runtime if I use a bit array directly. A different thread to discuss this in general terms would be interesting, and I'd love to participate, but whatever benefit I would see from changing from quicksort on data that is almost optimal for quicksort to some kind of mergesort would be minimal, at best a 5% speedup, and could easily increase runtime instead.
     
  5. Aug 24, 2007 #4

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    By comparison-based sort, I mean a sort routine that's based upon using < to determine the relative order of elements.

    Radix sort doesn't use <; it uses the leading digits of an element to determine in which (approximate) quantile it belongs, and then sorts the quantiles with some method.

    Radix (and quick) sort work in exactly the opposite fashion from merge sort; they split the data into (approximate) quantiles, then sort the quantiles by some method. Merge sort uses some method to sort segments of the data, and then merges the segments together.

    Radix sort is asymptotically faster than any comparison based sort, assuming that you have data representable as "digits", and the leading ones are sufficitly close to uniformly distributed1. In the optimal setup, where you can use just one pass to split the data into base-case-sized quantiles, the running time is actually linear in the number of elements.



    e.g. if you had RAM for a little over 2^30 records, but needed to sort 2^40 records on disk, you could perform one pass through the data, using the leading 10 bits to split the data into 2^30 segments (or more!), then sort each segment in RAM. The time requirement is

    . Two passes through the data on disk
    . 2^10 base case sorts
    . One look at the leading digits of 2^40 elements

    Whereas if you used a merge sort, and merged 2^10 streams together at once, the running time is
    . Two passes through the data on disk
    . 2^10 base case sorts
    . 2^40 pushes and pops into a 2^10-long priority queue


    Your choice of how to sort 2^30 records in RAM probably affects the overall running time -- I'm sure I've seen actual experimental results that demonstrate that quick sort requires about half as many compares as merge sort, and takes about half as much time... and I'm pretty sure that radix sort beats quick sort for uniformly distributed1 data.


    1: A slight variation to the radix sort is able to handle data with other known distributions.
     
  6. Aug 24, 2007 #5

    mgb_phys

    User Avatar
    Science Advisor
    Homework Helper

    It's also worth remembering that O(any power of N) is only valid for large N.
    In cases of small N a simple alogrithm can be faster than an fast one because the setup time is less - in real life N is usually small.
     
  7. Aug 25, 2007 #6

    rcgldr

    User Avatar
    Homework Helper

    Ok, I was thinking of radix X sort as a merge base X sort, it's an old terminology for merge sort that apparently is no longer used. A radix 2 sort used 4 storage devices, a radix 3 sort used 6 devices, a radix 4 sort used 8 devices.

    I don't get how 10 bits of data leads to splitting the data up into 2^30 segments. I would assume that 10 bits would let you split up the data into 2^10 segments.

    I'm not sure where the pushes and pops come from. With a merge sort all you need is a list of record counts, each count representing a group, an index into the list, and a total record count or group count to know when the end of the list is reached. A merge sort processes the list sequentially. It's the sequential nature of a merge sort that makes it so fast.

    A merge sort can take advantage of the ablity to read or write many records with a single access to a storage device. If there are 4 devices it's an entirely sequential I/O process.

    update - A radix sort could also read or write many records with a single I/O, but somewhere during the initial or second pass, there will be more random acessing, to write all the "bins" during the initial pass, or if written sequentially, to read the scattered clusters of "bins" during the second pass.

    For example, an Intel host bus adapter on a PC for SATA or PATA (IDE) hard drives can transfer 2MB of data with a single read or write command (that's 512 + 1 memory map segment descriptors, each segment being 4096 bytes long). If the record size is 128 bytes, then that's 16,384 records per read or write.

    With the sort example you gave, the radix method will need 1024 "bins" that records have to be written to, based on the leading 10 bits. If the distribution is near random, then most of the time only a single record will be written per access.

    update - Start off with 1024 bins in memory, each holds 16384 records, using 2MB, for a total of 2GB of memory. As each bin gets filled, it's written to one of 1024 files on the hard drive, or alternatively, all bins are written sequentially, and an array of bin numbers and counts is built, which will require random access accross the disk to read the data for each "bin" on the second pass.

    update - Assuming "bins" are written sequentially (should be faster), then each bin will require the disk to scan across the entire written file assuming even distribution, with random accesses skipping some of the data. That's 1024 passes over the written "bin" file to gather "bins". With the merge sort, it will take only 10 sequential passes of the data to sort it, so the merge sort will be faster.

    Regarding comparason sorts, a quicksort takes up to C x N^2 operations, where C is some constant, and N is the number of records, while a merge sort takes D x N x log2 N operations (or logX where X is the number of groups merged at a time), where D is probably a larger constant than C, but for reasonably large N, the merge sort is going to be quicker.

    For a memory sort, if knowledge about the data allows it to be quickly split up into "bins" like a radix sort, then that will be faster than a merge sort. However, with hard drives the penalty of access time makes the radix sort slower than a merge sort.
     
    Last edited: Aug 25, 2007
  8. Aug 25, 2007 #7

    rcgldr

    User Avatar
    Homework Helper

    example merge sort algorithm

    128 byte records, 1GB of ram used for record data, and the remainder of ram used for program, variables, and buffering.

    Initial pass: Create "free pool" link lists of 2MB buffers, one for input, two for output. Thread 1 gets buffer from input free pool, reads 16,384 records, adds the buffer to thread 2's input queue. Thread 2 dequeues the input buffers and moves the data into the 1GB record buffer, returns the input buffers back to the input free pool, and creates a list of pointers to each record, until 8,388,608 records are read, filling the 1GB record buffer. The list of pointers is then sorted. Thread 2 then gets buffers from the "even" output free pool, uses the sorted pointer list to fill the buffers with data, and alternately adds the buffer to thread 3's input queue, once the output group count is reached, it switched to thread 4's queues. If the environment supports scatter data writes, then just a list of pointers is sent for each output group and no data is moved. Thread 2 finally adds the group count of 8,388,608 records to the group count list. Thread 3 gets an output buffer and writes 16,384 records to the "even" output file, and returns the buffer to it's free pool. Thread 4 gets an output buffer and writes 16,384 records to the "odd" output file, and returns the buffer to it's free pool.

    Merge pass: Create "free pool" link lists of 2MB buffers, two for input, one for output. Thread 1 gets a buffer from "even input" pool, reads 16384 records from "even input file", and add the buffer to even input queue list. Thread 2 gets a buffer from "odd input" pool, reads 16384 records from "odd input file", and add the buffer it's to odd input queue list. Thread 3 gets one even and one odd buffer from input queues, and merges groups using the list of group counts while doing the merge, creating a new list of group counts that is half the size of the input group count list, and sends buffers to Thread 4. Thread 4 gets buffers, writes to "even output file" until the output group count (double the input group count) is reached, and then it switches to writing to the "odd output file". (last group may be even or odd depending on total number of records to be sorted).

    Final merge pass occurs when there are only two groups remaining. It's same as a normal merge pass, except any record formatting is done by the output program, and the output goes to a single file.
     
    Last edited: Aug 25, 2007
  9. Aug 25, 2007 #8

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    The runtime of quick sort is C N log2 N on almost any dataset -- the counterexamples are few and far inbetween. You will essentially never see a large dataset where median-of-3 quick sort performs quadratically, unless you specifically crafted it to have that behavior. And you can get around this by randomly selecting the partition: then, the initial ordering of the dataset is entirely irrelevant, so you get an expected C N log2 N running time on everything.

    If you are really worried about worst-case behavior, but want a deterministic comparison-based algorithm, then introsort is the winner: it normally uses quicksort, but if it detects pathological behavior, it switches over to one of the guaranteed N log N algorithms. (Merge sort, probably) In this way, it usually enjoys the extra speed of quick sort, but on bad data it very quickly switches over to the slower, but guaranteed, speed of merge sort.


    Merge sort is not D N logX N time when merging X streams at once -- you forget that when merging X streams, you continually have to decide which of your X things is smallest before writing, and that's an extra factor of log2 X (at the very least), so the running time is still (assuming the D doesn't change)
    D N logX N log2 X = D N log2 N​

    That's why I was talking about a priority queue -- if you're merging 1000 streams, you need a data structure to hold the 1000 elements under consideration, and each time you want to write a new element into the merged list, you have to determine which of those 1000 is smallest, remove it, and then insert its successor into the set of elements under consideration: i.e. you need a priority queue. A binary heap is the fastest structure I know for this task, and it requires something like log2 X time per operation.



    I guess that you can do a radix sort either way: e.g. start by sorting on the least significant digit instead of the most significant, but I haven't thought too much about it. It seems to me that starting with the most significant digit gives the clearest algorithm.



    example radix sort

    Same problem parameters.

    Split passes
    (For pass k, assume that the top 9(k-1) bits of each record are fixed, and the next 9 bits are roughly uniformly distributed)

    Allocate 512 buckets of 2 megabytes each.
    Allocate a few small overflow buffers.
    Allocate an input buffer.

    Thread 1 reads 16k records at a time, using the 9 bits to split the data into buckets. Whenever a bucket fills, thread 2 writes that data to the corresponding file on disk. (And thread 1 writes to that bucket are stored in an overflow buffer until the write is complete)

    Recursively perform split passes until you have 0.5 GB-sized files.

    Now, each file fits into half of memory, so we iterate through them, sorting with our favorite out-of-place sorting method and appending them to our output file.
     
    Last edited: Aug 25, 2007
  10. Aug 25, 2007 #9

    rcgldr

    User Avatar
    Homework Helper

    I corrected this to up to N^2. I've benchmarked quick sorts versus a merge sort and the merge sort is normally faster. Quicksort sorts pointers in place, while merge sort uses the equivalent of a "semi-sparse" array for the pointers by using double the memory for pointers, using this memory to hold two lists, an input list and an output list for each pass of the merge. If I have the time, I'll update my old 16 bit benchmark code and post links to it here.

    I use the "sparse factor" to compare sort times. A quick sort has a sparse factor of 1, it sorts a list of pointers in place. A merge sort has a sparse factor of 2, it uses twice the space (for the list of pointers). A radix X sort has a sparse factor of X, it uses X times amount of space (for pointers). It's logical to assume that the higher the sparse factor, the faster the sort. So a radix 3 or higher sort should be faster than a merge sort which should be faster than a quicksort. A radix 2 sort should be the same as a merge sort, since it just replaces one merge sort pass with bin split pass.

    True for a memory sort, but when using multiple storage devices, what counts is the number of passes made on the data. 4 storage devices require log2 M passes, while 8 devices require log4 M, where M is the number of groups created during the initial pass. 8 devices will be twice as fast; in your example, 4 devices will require 10 passes, but 8 devices would only require 5 passes.

    But a merge sort just merges two streams, outputting a single stream alternating between two output files, for each pass. In your example the merge sort will take 10 passes, but it's all sequential I/O, which disk drives are good at. Read ahead and write behind caching on hard drives will handle multiple sequential I/O requests with just a single random access hit. Assuming uniform distribution, the radix sort will rarely find a significant number of 2MB clusters to be adjacent and in order when randomly accessing the clusters. The radix sort is dealing with well over 512 files, and the random access overhead is significant. With a single hard drive, for each merge pass, the merge sort reads 1/2GB clusters from two files, and writes 1GB clusters to two other files, with an overhead of 3 random access hits per 1GB of data merged. With 2 hard drives, a merge sort would have one input and one output file on each hard drive. The input files would be read in parallel, and output files would be written alternately, with an overhead of 2 random access hits per 1GB of data merged. With 4 hard drives, a merge sort is all sequential I/O, one random access hit per pass (in your example that's just 10 random access hits).
     
    Last edited: Aug 25, 2007
  11. Aug 25, 2007 #10

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Okay, I could believe that the random access could be slow -- I was trying to minimize the number of 2MB transfers. (Of course, I could always just split into the same number of buckets as things I can write at a time; I think that would be the same number of sequential disk accesses, but less CPU time)

    I know I've compared for my CS classes the quick, merge, and heap sorts before, with quick sort being the winner by a factor of 2, and heap sort slightly ahead of merge. I don't remember the parameters though, nor did I know as much as I do now about memory optimization. (And, of course, this was about 9 years ago too)
     
    Last edited: Aug 26, 2007
  12. Aug 25, 2007 #11

    rcgldr

    User Avatar
    Homework Helper

    Link to zip of source, programs, and test source file:

    sort.zip

    The test source file is just a text file with 262,144 records, each record contains 8 ascii digits of "e" followed by cr/lf, which should be random and uniform distribution.

    To test the programs extract the zip to a temp directory, open a dos console window and enter the commands:

    qsort src.txt dstq.txt

    msort0 src.txt dstm.txt

    On my system, the fastest times I see are 93.75 milliseconds for qsort and 62.5 milliseconds for msort0. Somtimes it takes an additional 10ms, not sure what's causing this.
     
    Last edited: Aug 25, 2007
  13. Aug 25, 2007 #12

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Well, I imagine the fact you used the C library quick sort is the reason your merge sort was faster: qsort must be entirely generic, which means that knowledge of record size cannot be known at compile time, and the comparator must always be invoked through the function pointer. On the other hand, your merge sort is aware you are manipulating an array of pointers, and it inlines the invocation of CmprRecs.

    If you want to use library routines rather than rolling your own, the C++ versions in the STL (sort and stable_sort, which in the GNU implementation invoke introsort and an iterative merge sort, I think) are templates, and can benefit from compiler optimization.
     
    Last edited: Aug 25, 2007
  14. Aug 25, 2007 #13

    rcgldr

    User Avatar
    Homework Helper

    I didn't realize the compiler would inline CmprRecs, was trying to avoid it by using a pointer to function to make the comparason fair. You also have a point that qsort doesn't know at compile time that each "record" is a 4 byte pointer, so it has to do an extra level of indirection on every record access.

    However, with my previous runs years ago in 16 bit mode, I compared an assembler based heap sort with merge sort, and the merge was still quicker. I'll have to see if I can get permission to post the source for the heap sort.

    I updated sort.zip, using the user process time, but the resolution is still based on a 64hz ticker.
     
    Last edited: Aug 26, 2007
  15. Aug 26, 2007 #14

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    It's possible you've managed to, but I wouldn't be sure unless I actually did some testing. (Or inspected the assembly) e.g. call the function directly and see if it accelerates the mergesort.

    Incidentally, the C library qsort function isn't required to actually be a quick sort. While doing Google searches for my posts, I discovered that some implementations use a merge sort instead! (Or, maybe just for some cases)
     
  16. Aug 26, 2007 #15

    rcgldr

    User Avatar
    Homework Helper

    I added a counter to the CmprRecs function, to eliminate the issue of the generic overhead in the qsort. Msort0 has fewer compares than qsort. One reason for this is that when a pair of groups is merged, and the end of one group is reached, the remainder of the other group is copied without any compares.

    Regarding the Visual Studio 2005 implementation of qsort, I can't be sure of what it's doing, except I'm pretty sure that it's a sort in place (that it doesn't allocate memory to create a second list of element pointers required for a merge sort).

    I'm using GetProcessTimes(), actual user time of the process, it's based on a 64hz == 15.625ms ticker.

    msort0 uses fixed sized groups, oriented for random data files
    msort1 uses variable sized groups, oriented for files with some sequential (ascending or descending) data.
    qsort is microsoft libary qsort
    hsort is a heap sort.

    Link to zip of source, executables, test files and result.txt
    sort.zip

    result.txt - ticks are multiples of .1 microseconds, eg 781250 = 78.1250 milliseconds.
    The last 2 rows shows how the algorithms handle a file already sorted, msort1 will be fastest here.

    Code (Text):

                 msort0   msort1    qsort    hsort

    # ticks      625000   781250   937500  1562500   262144 10 byte records (e)
    # compares  4386555  4514936  5583563  8627661

    # ticks     4218750  4375000  6093750 14843750  1048576 10 byte records (pi)
    # compares 19644759 20157907 24587372 38705534

    # ticks     1250000   156250  1718750  4843750  1048576 10 byte records (srt)
    # compares 10485760  1048575 20701938 39648064
     
     
    Last edited: Aug 26, 2007
  17. Aug 26, 2007 #16

    rcgldr

    User Avatar
    Homework Helper

    Updated previous post to include two more sorts and another file.
     
  18. Aug 26, 2007 #17

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    It will take a bit to do the comparison properly, but I hacked something together quickly. I generated 2^20 random __int64's (MSVC's 64-bit integer type) with rand, then I do the following:

    Copy the data into a scratch buffer.
    Invoke a library routine to sort.

    Actually, I do this 100 times, to allow for more accurate timing. I used C++'s sort and stable_sort, and I used C's qsort. The results are:

    24308746 comparisons 0.16 seconds for sort
    24805310 comparisons 0.28 seconds for stable_sort
    20360941 comparisons 0.32 seconds for qsort


    The implementations are those provided in the MSVC 2005 express download, sort is implemented as an introsort (quicksort, but switch to heapsort if the recursion is bad), and stable_sort is implemented as a merge sort.

    My comparator was a function that simply invokes the operator<. (The times were 0.20, 0.28, 0.35 when I was counting the number of compares)



    If the data is already sorted, my results are

    15933385 comparisons 0.06 seconds for sort
    09895936 comparisons 0.14 seconds for stable_sort
    15735219 comparisons 0.12 seconds for qsort
     
    Last edited: Aug 26, 2007
  19. Aug 26, 2007 #18

    rcgldr

    User Avatar
    Homework Helper

    update - I see up updated your results, so I'm updating this post.

    2^20 = 1048576 10 byte records (each record has 8 digits of pi, cr, lf)

    24587372 compares with VS2005 qsort vs 20360941 for your qsort / data.
    20157907 compares with merge sort 1.
    19644759 compares with merge sort 0.

    sorted file:

    20701938 compares with VS2005 qsort vs 15735219 for your qsort / data.
    10485760 compares with merge sort 0.
    1048575 compares with merge sort 1 (minimum).
     
    Last edited: Aug 26, 2007
  20. Aug 26, 2007 #19

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    My cut-paste missed where I was resetting my counter. I'm updating my results; it's curious that MSVC's qsort uses less compares than its other two sorting routines! It makes sense, I suppose; because it's forced to call the comparator through a function pointer, it's worth doing some extra work to reduce the number of comparisons.
     
  21. Aug 26, 2007 #20

    rcgldr

    User Avatar
    Homework Helper

    I looked up rand(), and it's limited to a max value of 32767. How did you expand this to 64 bits?
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Efficient sorting
  1. Sorting revisited. (Replies: 17)

  2. Selection sort? (Replies: 2)

  3. Sorting algorithm? (Replies: 2)

  4. Perfect sort (Replies: 2)

  5. Merge sort (Replies: 1)

Loading...