Efficient Merge Sort for External Storage Optimization

Hurkyl · Sep 20, 2007

Hehe.

I still care, but I've had lots to do since I got back home, so I haven't had time to really work on this.

It doesn't help that my desktop doesn't like visual C++ express.

rcgldr · Sep 21, 2007

What's the issue you're having?

I created 2 more programs that should compile with Studio Express (there's no windows.h in express, so I avoided any functions based on this). The source files will build in 32 bit or 64 bit mode with no changes (or even conditionals).

http://jeffareid.net/misc/sortcpp.zip

As mentioned before, the Microsoft routines are slow when using pointers to functions, and I'm not sure why. The sort routines are not libraries, but are actual code in <algorithm> that get compiled along with source code that includes it, so things like size of elements are optimized at compile time. Since it's just included source code, then the pointer to function versions (which are actually separate pieces of source code) should benefit from any compiler optimization that user generated code does.

The overhead of the pointer to function routines would mean that the Microsoft sorts would be a poor choice if what is being sorted is a list of pointers to records, which would be the normal method if record size was significant. So it appears that the Microsoft sort routines are only good for sorting arrays of reasonably small elements.

Hurkyl · Sep 21, 2007

What optimizers do and don't do can be pretty weird. Last time I was playing with the speed of sort routines, I was playing with comparators, and I discovered that gcc wouldn't inline a nearly trivial function object, unless I explicitly declared it inline.

This was frusterating, because the STL class less<T> and the like did not declare their operator() method as inline, and thus was performing poorly.

rcgldr · Sep 21, 2007

Updated the code to use fopen, fwrite, and have CMPFUNC option again, works with Studio Express (will have to test 64 bit mode later). update - confirmed the same source works just fine in 64 bit mode.

sortv.zip

In 32 bit mode, sort(), and msort() take about the same time (on 4 million __int64's). In 64 bit mode sort() is about 14% faster. If ptr to function for compare is used, the Microsoft routines slow down quite a bit; for one thing the ptr to function is passed as a paramter, so nested calls add to the overhead, and there are about 60% more compares with sort() versus msort(). If data size is reasonably large and pointers to the data are sorted instead of the data (in memory, this is a terrible way to do this on a disk), then the penalty of the ptr to function makes both Micosoft routines much slower.

As mentioned before if the data isn't truly random, and has a signifcant amount of pre-ordered data, then the msort1() from http://jeffareid.net/misc/sort.zip will be the fastest sort.

rcgldr · Sep 25, 2007

I redid the text file sort routines, so that the Microsoft vector routines could be used. The sequence is to read in a text file, convert records terminated by CR LF into p-strings, generate a set of pointers to each record, sort via the pointers, then output a file using the sorted pointers.

In this case, Microsoft's quicksort routine, sort() was 50% slower than their merge sort routine, stable_sort(), which was about 15% slower than my merge sort. A pointer to function is required since the routine compares records given 2 pointers.

Included in this zip is sortp.cpp (Microsoft sort or stable_sort), msort.cpp (my sort), and a test source file, src.txt (1,000,000 records, 8 bytes of pi digits, CR, LF). This time I enabled the variable group size define which speeds up the merge sort if there's any significant ordering of data. With the random test file, the time is about the same as with the variable group size define disabled, about 2% more comparasons, but with some groups created larger than 2, and none smaller than 2 (except for last record on a odd number of records file).

sortp.zip

The challenge now is to see if someone can make a quicksort based algorithm run faster than my merge sort for this case of a text file sort via pointers to records.

Hurkyl · Sep 25, 2007

Jeff Reid said:

A pointer to function is required since the routine compares records given 2 pointers.

A function object would work:

Code:

struct comparator
{
   inline bool operator()(const unsigned char *a, const unsigned char *b) const
   {
      // compare somehow. We know they're 8 bytes each, so...
      return std::lexicographical_compare(a, a + 8, b, b + 8);
      // memcmp might be faster, though.
   }
};
// ...
std::vector<const char*> array;
// ...
std::sort(array.begin(), array.end(), comparator());

rcgldr · Sep 25, 2007

The 8 byte or 2 __int64 compare would work for src.txt, but this is supposed to be a generic text file sort, where record size is not fixed, or if fixed, not a known size at compile time. If I have time, I'll try modifying the quicksort, sort(), that doesn't use the function to pointer, and see if I can replace the compares with a call to a function that does memcmp like my code does.

Still, once compare time become more significant than swap or move time, which can happen when using 32-bit pointers to records, the quicksort is going to be slower since it does so many more compares than a merge sort. Quicksort's main advantage is for sorting data, not pointers to large records, because it only swaps data when needed, versus the merge sort which is alway moving data, except for the initial pass to create groups (for a random file it swaps about 1/2 of the time). This changes when it's pointers and not data that's being swapped or moved around.

Hurkyl · Sep 27, 2007

I've just started poking at it. It's hard to to think of how to optimize a quicksort without cheating; all of my best ideas either turn into a hybrid with radix sort, or get incredibly lucky with the 8 characters-per-line setup.

Speaking of radix sort, my first draft runs as fast as your merge sort for the pi file.

A slight variation shaves 10% off of the runtime when sorting pi -- in my code, when the buckets are small enough, I switch to std::sort. If that happens after n levels of radix sort, I can tell the comparator to start comparing at character #n of each string, rather than at character #0.

I can shave another 9% off by copying the buckets back into the original array before recursing.

rcgldr · Sep 28, 2007

In the case of the pi file, you only need 10 buckets and the distribution between buckets should be equal, so a radix sort should work well. As I mentioned before, the ratio of memory used versus size of data should help with a sort, as long as the overhead doesn't get too large.

My main point was that "in-place" sorts are usually slower than a merge sort, but I was surprised that Microsoft's quicksort, sort(), was faster until I realized that it avoids movement of data, although it does more compares. Also it doesn't preserve record order for "equal" records, but I'm not sure how much of a factor that is. However it's much slower when using a function pointer for compares, but this is partially due to the passing of the address of the function in all the nested calls instead of using a global pointer.

You should try my sort program using the already sorted pi file that is output from a run as input on another run. In that case it only does the bare minimum 10^20-1 compares.

For merge sorts, there's no hard rule for the initial pass used to create the groups. In my version I simply looked for increasing or decreasing sequences (reversing pointers afterwards to create a group), which there won't be a lot of, in pure random data, but would occur more often in "typical" data. This is commonly called a "natural" merge sort. In Microsoft's stable_sort(), they use an insertion sort to create groups of 32 elements, and I'm not sure if 32 is an optimal number for random data.

As a hybrid sort, a radix sort could be used on the initial pass, then the first record from each bucket used to create ordered groups of 10 (or less as buckets emptied) records, then MergeGroups could be used to finish off the sort. All of this is counting on the data being fairly uniform in distribution though, because the groups would be small if some of the buckets emptied quickly compared to others. Although this would be good for the pi or similar pure random data, it wouldn't be good with other patterns of data.

This has been an enjoyable thread for me, as it gave me incentive to re-write some sort code I haven't done in years. The last bit of interesting code I worked on was error correction code and compression software, since then, it's been the more mundane typical things that programmers actually get paid to do.

Hurkyl · Sep 28, 2007

The main reason I was thinking about the hybrid was because I was considering storing not just pointers in the array to be sorted, but a segment of the strings. e.g. I would sort a list of:

struct {
const unsigned char *pointer_to_real_string;
const unsigned char sample_data[8];
};

So an L1 or L2-cache-sized array of these things would be much faster to sort, if most comparisons could be resolved by the sample data. But since I'm only sorting on 8 bytes at a time, to handle the general case, the code would have to turn into a radix-2^64 sort, with qsort used to separate the data into buckets. And I feel like I'm cheating when testing on a uniform random file of 8 byte records.

rcgldr · Sep 28, 2007

Link to another file to test the sorts with, which should be more fair, based on the intent of a text sort.
It's the intructions and operands (dissassembly without addresses) of an old program. Note, records are not all the same length, but the stuff I wrote handles this just fine.

http://jeffareid.net/misc/src1.zip

rcgldr · Oct 4, 2007

It's been a while since I've seen a post from Hurkyl. I'm looking for a large public domain document that I can covert to text as another test file.

Hurkyl · Oct 4, 2007

Bah, the U.S. constitution

http://www.constitution.org/cons/constitu.txt

isn't big enough.

What about that sequence of usenet postings for GR? Lemme look for it...

Bah, the bible seems to be too short too

http://patriot.net/~bmcgin/kjv12.txt

but at least things all run in nonzero time. On this document, my implementation using STL's sort beats your merge sort which beats my radix sort.

I used std::sort to sort an array of std::pair<const char*, const char*> (first entry of the pair is beginning a line, second entry is the past-the-end pointer of a line) using a comparator that invokes std::lexicographical_compare.

(I think it's essentially the same as in my radix sort code, except that I call std::sort instead of my radix sort)

I really ought to implement a data structure like your string to see if I can improve it any more.

rcgldr · Oct 4, 2007

Hurkyl said:

On this document, my implementation using STL's sort beats your merge sort which beats my radix sort.

Did you try the test source file from http://jeffareid.net/misc/src1.zip yet? (Note this is different than the src1.txt from before which was made up of 8 digits of pi, cr, lf, rename that one or the one in src1.zip).

I really ought to implement a data structure like your string to see if I can improve it any more.

You could just use my source code (feel free to use it), mainly the getdata (read file) which converts the records terminated by <cr> <lf> to <00> <cnt> where <cnt> is the count for a p-string, and the initpointers, which creates an array of pointers to all the p strings.

rcgldr · Oct 5, 2007

I updated sortp.zip to include two test files. src1.txt is the pi file, 8 digits of pi, followed by cr, lf for each record, 2^20 records src2.txt is assembly opcodes and operands from a dissassembly, terminated with cr, lf, and also 2^20 records. src2.txt will take a bit longer, since there's more similar data in the file, causing longer compares.

Regarding the radix sort, this could be done with 10 buckets and the pi (src1.txt) file. separate the data into 10 buckets according to the 8th digit in each record. Merge the buckets (or use a second set of 10 buckets), and repeat for the 7th digit, 6th digit, until the 1st digit is done. Basically the same algorithm used by card sorters. For binary data, 256 buckets and bytes could be used. Number of passes is the number of sub-elements in a record, for example, 8 passes for an 8 byte record processed with 256 buckets.

Hurkyl · Oct 6, 2007

A bottom-up radix sort would probably work well on the pi file, but I can't imagine it would do well for general text sorting, since it cannot take advantage of the possibility that the first fraction of the characters are enough to mostly sort the data.

rcgldr · Oct 6, 2007

Hurkyl said:

A bottom-up radix sort would probably work well on the pi file, but I can't imagine it would do well for general text sorting.

Agreed, it's still my opinion that the "natural" merge sort would be best for general text sorting, especially if there's any significant pre-ordering of data. Still the quicksort is fast, and in the case of just sorting numbers (as from the other thread) the fact that 4 million 64 bit numbers can be sorted in less than a second with either algorithm makes the choice less important.

I'd like to see the merge sort article at Wiki cleaned up to reflect the fact that most merge sorts are "bottom up" using a loop (plus the initial create group pass), instead of "top down" with recursion, and that awful picture that goes along with the top down method.

As mentioned before, it's been interesting working on sort programs again, plus I learned a bit about the Microsoft Standard Template Library. I thought it was clever that this "library" really isnt' a library, but code that gets compiled with the user's code. Is the syntax used for the standard template library part of C++ or is it an extension?

Hurkyl · Oct 6, 2007

The STL is, in fact, part of the C++ standard! It's fairly useful as is, but I get the impression it's in the process of being vastly extended (see Technical Report 1. Yay for boost).

There are a few STL resources on the internet; sgi's page on the STL is, IMHO, really good. It's outdated, but it's mostly right.

Templates are a really useful feature; as you've seen with the STL examples, they reduce "abstraction overhead" -- while http://www.oonumerics.org/blitz/benchmarks/ might be slightly slower for linear algebra with large arrays of double's... the C++ library is immediately applicable to any other numeric type you might think to do linear algebra with! (e.g. finite field elements, 128-bit integer types, etc) And template metaprogramming let's you make some optimizations that border on being ridiculous -- and would be completely infeasible in any other language, except by writing a program to write your program for you!

Sorry, I didn't mean to go off on a sales pitch, but I just think this stuff is really cool.

Efficient Merge Sort for External Storage Optimization

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

How to increase phone signal strength by lying about it

Who is responsible for the software when AI takes over programming?

Learning Assembly and computer architecture for x86

Use of AI (ML/DL) in Science

Could the reason why I can't select any kernels in VS Code be this error?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers