Demo of Sedgewick Bentley 3 way radix quicksort for very BIG files.

hutch-- · January 02, 2015, 07:44:21 PM

I have posted this mainly for Gunther but its the same sort algo posted in the PB subforum. It is designed specifically for sorting CRLF delimited text files in ascending order and is to date the fastest string sort I have seen. I found the original design at the Dr Dobbs site, coded it in MASM, bundled the functions together to simplify using the algo.

The test file I have been using is 1474697874 (1.47 gig) and the typical results for the sort is as follows.

K:\asm3\sbsort>sorttest > sorted.txt
2714 ms sort time for file of 1474697874 bytes with record count = 10000001

It does not directly do the disk write as this is a test piece, to test it you redirect the output the standard out, if you run it directly on a big file you may grow old watching the contents scroll past. To try it out you will have to supply your own BIG file and rebuild it with the name changed in the file load. Files of line counts of a million or so may not run long enough to time accurately.

jj2007 · January 02, 2015, 10:40:39 PM

Hutch,

The timings are fantastic, it's a factor 3 faster than MasmBasic's QSort :t

What I don't yet understand is
a) why it comes to a different line count than Recall():
47 ms sort time for file of 10664371 bytes with record count = 268423 (Recall: 309747)
b) the sort criteria:

Code Select

ñññññññññññññññññññññññññññññññññññññññññññññññññññññññññññññññññ *
ñññññññññññññññññññññññññññññññññññññññññññññññññññññññññññññññññ *
½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½ *
½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½ *
½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½ *
½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½ *
½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½ *
½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½½ *
"MS Sans Serif",8, \        ; font name & point size
"MS Sans Serif",8, \        ; font name & point size
"MS Sans Serif",8, \        ; font name & point size
"MS Sans Serif",8, \        ; font name & point size
"h" and optionally a leading "0" if required. Total string length
"h" and optionally a leading "0" if required. Total string length
"h" and optionally a leading "0" if required. Total string length
"h" and optionally a leading "0" if required. Total string length
"one shot" or "one pass" pads
"one shot" or "one pass" pads
"testreg" receives the address of the zero terminated string in
"testreg" receives the address of the zero terminated string in

It seems the sort starts in the high ASCII ranges (164, 171), then restarts lower(quote=34, %=37, ...). How do you treat empty lines? Leading spaces? Is the sort case-sensitive?

For comparison, the first lines of the MasmBasic version (with some fancy settings, see QSortMode and attached source):

Code Select


162 ms sort time for file of 10664371 bytes with record count = 309747
% echo @CatStr(<Fix: >, @SubStr(%@FileCur, spos+1,), <(%@Line) - txt>)
% forc chr, @FileCur
% forc chr, @FileCur            ;; Don't display full path. Easier to read.
----------------- *
--------------------
------------------------ *
-------------------------

Testfile was generated with attached batch file.

Gunther · January 03, 2015, 04:59:52 AM

The algorithm has an excellent run time behavior. It's orders of magnitudes faster than qsort() implemented with ANSI C. I've tormented a bit the search engine: Sedgewick was a scholar by Knuth. That speaks for quality.

Gunther

hutch-- · January 03, 2015, 06:43:44 AM

I have tidied it up a bit, put the algo into a proper procedure with a prototype and changed the standard out to a block write of 16 at a time that certainly improved the disk output with redirection.

About 6 - 7 years ago I played with sorts and did many different types and while most of them responded well to manual optimisation, the 3 way radix quick sort was so well designed that it found its speed limit to the boundaries of memory speed and nothing would make it faster. I posted the Dr Dobbs link in the PB forum to get the original design.

RE the behaviour with blank lines, leading spaces etc, it will sort whatever text based data you point at it but the tokeniser I have used trims leading blanks and bypasses empty lines, you would basically make sure you tidy up what you point at the sort algo first.

jj2007 · January 03, 2015, 07:09:11 AM

Quote from: hutch-- on January 03, 2015, 06:43:44 AMRE the behaviour with blank lines, leading spaces etc, it will sort whatever text based data you point at it but the tokeniser I have used trims leading blanks and bypasses empty lines, you would basically make sure you tidy up what you point at the sort algo first.

And strings starting with ASCII 164 or 171 are treated as negative (see above)?

hutch-- · January 03, 2015, 10:47:07 AM

It is actually specified as a string sort so characters 164 and 171 may not work in the correct order.

The MASM Forum

News:

Demo of Sedgewick Bentley 3 way radix quicksort for very BIG files.

hutch--

jj2007

Gunther

hutch--

jj2007

hutch--