Efficient Merge Sort for External Storage Optimization

Hurkyl · Sep 25, 2007

Jeff Reid said:

A pointer to function is required since the routine compares records given 2 pointers.

A function object would work:

Code:

struct comparator
{
   inline bool operator()(const unsigned char *a, const unsigned char *b) const
   {
      // compare somehow. We know they're 8 bytes each, so...
      return std::lexicographical_compare(a, a + 8, b, b + 8);
      // memcmp might be faster, though.
   }
};
// ...
std::vector<const char*> array;
// ...
std::sort(array.begin(), array.end(), comparator());

rcgldr · Sep 25, 2007

The 8 byte or 2 __int64 compare would work for src.txt, but this is supposed to be a generic text file sort, where record size is not fixed, or if fixed, not a known size at compile time. If I have time, I'll try modifying the quicksort, sort(), that doesn't use the function to pointer, and see if I can replace the compares with a call to a function that does memcmp like my code does.

Still, once compare time become more significant than swap or move time, which can happen when using 32-bit pointers to records, the quicksort is going to be slower since it does so many more compares than a merge sort. Quicksort's main advantage is for sorting data, not pointers to large records, because it only swaps data when needed, versus the merge sort which is alway moving data, except for the initial pass to create groups (for a random file it swaps about 1/2 of the time). This changes when it's pointers and not data that's being swapped or moved around.

Hurkyl · Sep 27, 2007

I've just started poking at it. It's hard to to think of how to optimize a quicksort without cheating; all of my best ideas either turn into a hybrid with radix sort, or get incredibly lucky with the 8 characters-per-line setup.

Speaking of radix sort, my first draft runs as fast as your merge sort for the pi file.

A slight variation shaves 10% off of the runtime when sorting pi -- in my code, when the buckets are small enough, I switch to std::sort. If that happens after n levels of radix sort, I can tell the comparator to start comparing at character #n of each string, rather than at character #0.

I can shave another 9% off by copying the buckets back into the original array before recursing.

rcgldr · Sep 28, 2007

In the case of the pi file, you only need 10 buckets and the distribution between buckets should be equal, so a radix sort should work well. As I mentioned before, the ratio of memory used versus size of data should help with a sort, as long as the overhead doesn't get too large.

My main point was that "in-place" sorts are usually slower than a merge sort, but I was surprised that Microsoft's quicksort, sort(), was faster until I realized that it avoids movement of data, although it does more compares. Also it doesn't preserve record order for "equal" records, but I'm not sure how much of a factor that is. However it's much slower when using a function pointer for compares, but this is partially due to the passing of the address of the function in all the nested calls instead of using a global pointer.

You should try my sort program using the already sorted pi file that is output from a run as input on another run. In that case it only does the bare minimum 10^20-1 compares.

For merge sorts, there's no hard rule for the initial pass used to create the groups. In my version I simply looked for increasing or decreasing sequences (reversing pointers afterwards to create a group), which there won't be a lot of, in pure random data, but would occur more often in "typical" data. This is commonly called a "natural" merge sort. In Microsoft's stable_sort(), they use an insertion sort to create groups of 32 elements, and I'm not sure if 32 is an optimal number for random data.

As a hybrid sort, a radix sort could be used on the initial pass, then the first record from each bucket used to create ordered groups of 10 (or less as buckets emptied) records, then MergeGroups could be used to finish off the sort. All of this is counting on the data being fairly uniform in distribution though, because the groups would be small if some of the buckets emptied quickly compared to others. Although this would be good for the pi or similar pure random data, it wouldn't be good with other patterns of data.

This has been an enjoyable thread for me, as it gave me incentive to re-write some sort code I haven't done in years. The last bit of interesting code I worked on was error correction code and compression software, since then, it's been the more mundane typical things that programmers actually get paid to do.

Hurkyl · Sep 28, 2007

The main reason I was thinking about the hybrid was because I was considering storing not just pointers in the array to be sorted, but a segment of the strings. e.g. I would sort a list of:

struct {
const unsigned char *pointer_to_real_string;
const unsigned char sample_data[8];
};

So an L1 or L2-cache-sized array of these things would be much faster to sort, if most comparisons could be resolved by the sample data. But since I'm only sorting on 8 bytes at a time, to handle the general case, the code would have to turn into a radix-2^64 sort, with qsort used to separate the data into buckets. And I feel like I'm cheating when testing on a uniform random file of 8 byte records.

rcgldr · Sep 28, 2007

Link to another file to test the sorts with, which should be more fair, based on the intent of a text sort.
It's the intructions and operands (dissassembly without addresses) of an old program. Note, records are not all the same length, but the stuff I wrote handles this just fine.

http://jeffareid.net/misc/src1.zip

rcgldr · Oct 4, 2007

It's been a while since I've seen a post from Hurkyl. I'm looking for a large public domain document that I can covert to text as another test file.

Hurkyl · Oct 4, 2007

Bah, the U.S. constitution

http://www.constitution.org/cons/constitu.txt

isn't big enough.

What about that sequence of usenet postings for GR? Lemme look for it...

Bah, the bible seems to be too short too

http://patriot.net/~bmcgin/kjv12.txt

but at least things all run in nonzero time. On this document, my implementation using STL's sort beats your merge sort which beats my radix sort.

I used std::sort to sort an array of std::pair<const char*, const char*> (first entry of the pair is beginning a line, second entry is the past-the-end pointer of a line) using a comparator that invokes std::lexicographical_compare.

(I think it's essentially the same as in my radix sort code, except that I call std::sort instead of my radix sort)

I really ought to implement a data structure like your string to see if I can improve it any more.

rcgldr · Oct 4, 2007

Hurkyl said:

On this document, my implementation using STL's sort beats your merge sort which beats my radix sort.

Did you try the test source file from http://jeffareid.net/misc/src1.zip yet? (Note this is different than the src1.txt from before which was made up of 8 digits of pi, cr, lf, rename that one or the one in src1.zip).

I really ought to implement a data structure like your string to see if I can improve it any more.

You could just use my source code (feel free to use it), mainly the getdata (read file) which converts the records terminated by <cr> <lf> to <00> <cnt> where <cnt> is the count for a p-string, and the initpointers, which creates an array of pointers to all the p strings.

rcgldr · Oct 5, 2007

I updated sortp.zip to include two test files. src1.txt is the pi file, 8 digits of pi, followed by cr, lf for each record, 2^20 records src2.txt is assembly opcodes and operands from a dissassembly, terminated with cr, lf, and also 2^20 records. src2.txt will take a bit longer, since there's more similar data in the file, causing longer compares.

Regarding the radix sort, this could be done with 10 buckets and the pi (src1.txt) file. Seperate the data into 10 buckets according to the 8th digit in each record. Merge the buckets (or use a second set of 10 buckets), and repeat for the 7th digit, 6th digit, until the 1st digit is done. Basically the same algorithm used by card sorters. For binary data, 256 buckets and bytes could be used. Number of passes is the number of sub-elements in a record, for example, 8 passes for an 8 byte record processed with 256 buckets.

Hurkyl · Oct 6, 2007

A bottom-up radix sort would probably work well on the pi file, but I can't imagine it would do well for general text sorting, since it cannot take advantage of the possibility that the first fraction of the characters are enough to mostly sort the data.

rcgldr · Oct 6, 2007

Hurkyl said:

A bottom-up radix sort would probably work well on the pi file, but I can't imagine it would do well for general text sorting.

Agreed, it's still my opinion that the "natural" merge sort would be best for general text sorting, especially if there's any significant pre-ordering of data. Still the quicksort is fast, and in the case of just sorting numbers (as from the other thread) the fact that 4 million 64 bit numbers can be sorted in less than a second with either algorithm makes the choice less important.

I'd like to see the merge sort article at Wiki cleaned up to reflect the fact that most merge sorts are "bottom up" using a loop (plus the initial create group pass), instead of "top down" with recursion, and that awful picture that goes along with the top down method.

As mentioned before, it's been interesting working on sort programs again, plus I learned a bit about the Microsoft Standard Template Library. I thought it was clever that this "library" really isnt' a library, but code that gets compiled with the user's code. Is the syntax used for the standard template library part of C++ or is it an extension?

Hurkyl · Oct 6, 2007

The STL is, in fact, part of the C++ standard! It's fairly useful as is, but I get the impression it's in the process of being vastly extended (see Technical Report 1. Yay for boost).

There are a few STL resources on the internet; sgi's page on the STL is, IMHO, really good. It's outdated, but it's mostly right.

Templates are a really useful feature; as you've seen with the STL examples, they reduce "abstraction overhead" -- while http://www.oonumerics.org/blitz/benchmarks/ might be slightly slower for linear algebra with large arrays of double's... the C++ library is immediately applicable to any other numeric type you might think to do linear algebra with! (e.g. finite field elements, 128-bit integer types, etc) And template metaprogramming let's you make some optimizations that border on being ridiculous -- and would be completely infeasible in any other language, except by writing a program to write your program for you!

Sorry, I didn't mean to go off on a sales pitch, but I just think this stuff is really cool.

Efficient Merge Sort for External Storage Optimization

Attachments

Similar threads

Hot Threads

Recent Insights