For use with all kinds of sorting algos - executable and source attached.
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
mov ecx, Val(Input$("Rand(): #elements to create: ", "1000000"))
.if signed ecx>0
shl ecx, 2 ; get DWORDs
NanoTimer()
Let edi=New$(ecx)
push edi ; we need it later
push ecx
.Repeat
void Rand(-1)
stosd
sub ecx, DWORD
.Until Sign?
pop ecx
pop edi
Let esi=Str$("Creating %i random DWORDs took ", ecx/DWORD)+Str$("%i ms\n", NanoTimer(ms))
Open "O", #1, "RandomDwords.dat"
PrintBuffer #1, edi, ecx
Close #1
Inkey esi, CrLf$, 'To read back the data into a buffer,', CrLf$, 'MasmBasic fans may use', CrLf$, 'Let esi=FileRead$("RandomDwords.dat")', CrLf$, 'otherwise check Masm32 InputFile'
.endif
Exit
end start
Thank you. Here is the result
Looks OK:
0 3080
1 9158
2 13464
3 15479
4 17700
5 22867
6 26355
7 29191
8 32229
9 42286
999991 4294921291
999992 4294921704
999993 4294929912
999994 4294934235
999995 4294945952
999996 4294946053
999997 4294951189
999998 4294958802
999999 4294960543
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
Dim MyDw() As DWORD
ArrayRead (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1179) MyDw(), "SortedDwords.dat"
push eax ; #elements
xor ecx, ecx
.Repeat
mov eax, stack
sub eax, 10
.if ecx<10 || ecx>eax
PrintLine Str$(ecx), Str$("\t%u", MyDw(ecx))
.endif
inc ecx
.Until ecx>=stack
pop eax
Inkey "ok?"
Exit
end start
And the timings?
Tks, i´m uploading now the one from nrQsortA to compare. As i explained the resultatn data does not seems to be accurated. In fact, it is unsorted when feeding it with 1 million of random values. This explains why it took only a couple of secs to sort.
It seems like nrQsortA needs a review, because the sort it is being breaked when feeded with 1 million random values. Also, if i try to feed it with the already sorted data, nrQsortA ruins the result and produce unsorted values. There is a problem in the middle of the nrQsortA function causing it to stop and producing a wrong result.
Masm Result when feeded with RandomDwords.dat (Also, the same result is retrieved when you feede it with SortedDwords.dat):
Quote0 2147484709
1 2147490471
2 2147491291
3 2147491787
4 2147501199
5 2147513906
6 2147518668
7 2147519738
8 2147520028
9 2147521313
999991 2147439241
999992 2147439798
999993 2147440421
999994 2147450671
999995 2147450920
999996 2147454024
999997 2147454425
999998 2147476822
999999 2147481168
ok?
Guga result when feeded with RandomDwords.dat (This is the same correct result when feeded with SortedDwords.dat)
Quote0 3080
1 9158
2 13464
3 15479
4 17700
5 22867
6 26355
7 29191
8 32229
9 42286
999991 4294921291
999992 4294921704
999993 4294929912
999994 4294934235
999995 4294945952
999996 4294946053
999997 4294951189
999998 4294958802
999999 4294960543
ok?
I also gave a test on nrQsortA feeding it with an already sorted data (SortedDwords.dat), and even if it took only 16 milisecs to the function finishes, it completelly ruined the result, and changed. I didn´t analyzed nrQsortA to see why it is being breaked (perhaps, i´m porting it incorrecty to RosAsm)
On the other hand, on my algo, even if i pass a already sorted file it takes almost nothing to sort, n fact, if i feed it with SortedDwords.dat it took also only 16 milisecs to the algo finishes and the result is the same as in the loaded file.
Btw....did you tried to feed your algo (ArraySort) on this file (RandomDwords.dat) to test how much time it takes to parse ? Do you have the timmings of it ?
P.S: About the timmings, i´ve got this (i7 CPU 870, 2.93 Ghz)
QuoteGuga Algo 1 = 723718 miliseconds = 12.061966667 minutes
Guga Algo 1 = 719438 miliseconds = 11.990633333 minutes
Guga Algo 2 = 707640 miliseconds = 11.794 minutes
Hi Gustavo,
I don't understand what you mean with "wrong results". They do look different, but it's because your algo sorts unsigned values, while mine sorts signed DWORDs:
# signed unsigned hex (sorted by Guga)
0 3080 3080 00000C08
1 9158 9158 000023C6
2 13464 13464 00003498
3 15479 15479 00003C77
4 17700 17700 00004524
5 22867 22867 00005953
6 26355 26355 000066F3
..
999994 -33061 4294934235 FFFF7EDB
999995 -21344 4294945952 FFFFACA0
999996 -21243 4294946053 FFFFAD05
999997 -16107 4294951189 FFFFC115
999998 -8494 4294958802 FFFFDED2
999999 -6753 4294960543 FFFFE59F
# signed unsigned hex (sorted by ArraySort)
0 -2147482587 2147484709 80000425
1 -2147476825 2147490471 80001AA7
2 -2147476005 2147491291 80001DDB
3 -2147475509 2147491787 80001FCB
4 -2147466097 2147501199 8000448F
5 -2147453390 2147513906 80007632
6 -2147448628 2147518668 800088CC
..
999994 2147450671 2147450671 7FFF7F2F
999995 2147450920 2147450920 7FFF8028
999996 2147454024 2147454024 7FFF8C48
999997 2147454425 2147454425 7FFF8DD9
999998 2147476822 2147476822 7FFFE556
999999 2147481168 2147481168 7FFFF650
Re timings: ArraySort() takes 110 ms on my trusty old Celeron. If you provide your algo in Masm format, I'll try to put it into a testbed.
Hi JJ
I didn´t tested your algo yet, i used as a test the one from Masm lib (nrQsortA from qsorta.asm)
But, i´m getting confused, because on small chuncks nrQsortA takes more time to parse, according to my benchmarks tests.
Can you test my algo ? I´m not understanding what´s happenning, because on a real file (SortedDwords.dat), my algo takes absurd 12 minutes, while nrQsortA takes 94 ms ????
My algo in masm, should be like this, i believe:
Guga_Insort proc Array:DWORD, ArrayLen:DWORD
push edi
push ebx
mov ebx, 1
mov edi, Array
mov eax, ebx
sub eax, ArrayLen
jge OutFunction
MainLoop:
mov ecx, [edi+ebx*4]
InternalLoop:
mov eax, [edi+ebx*4-4]
cmp eax, ecx
jbe OutInternalLoop
mov [edi+ebx*4], eax
dec ebx
jnz InternalLoop
OutInternalLoop:
mov [edi+ebx*4], ecx
inc ebx
cmp ebx, ArrayLen
jb MainLoop
OutFunction:
pop ebx
pop edi
ret ; <--- Is Endp in masm already using a retn ??? If it does, then, this ret must be removed
Guga_Insort endp
I´m not sure about the ret instruction in masm. In RosAsm "endp" is a macro that already contains a retn. So, if ret is ok in my attemtp to port, keep it, otherwise, just remove the ret.
I disassembled my own algo, if it makes easy to port to Masm.
Guga_insort proc near
Array = dword ptr 8
ArrayLen = dword ptr 0Ch
push ebp
mov ebp, esp
push edi
push ebx
mov ebx, 1
mov edi, [ebp+Array]
mov eax, ebx
sub eax, [ebp+ArrayLen]
jge short loc_4069FA
loc_4069DC:
mov ecx, [edi+ebx*4]
loc_4069DF:
mov eax, [edi+ebx*4-4]
cmp eax, ecx
jbe short loc_4069ED
mov [edi+ebx*4], eax
dec ebx
jnz short loc_4069DF
loc_4069ED:
mov [edi+ebx*4], ecx
inc ebx
cmp ebx, [ebp+ArrayLen]
jb loc_4069DC
loc_4069FA:
pop ebx
pop edi
mov esp, ebp
pop ebp
retn 8
insort endp
Btw, here are some benchmarks i´m testing.
It seems that my algo (Insort optimized) is only faster for small chuncks of elements. The difference starts when there are more then 20 elements of the array. Up to that amount, nrQsortA is faster
For a accurated result. I inserted before each one of the algos, a function called "RandomGen", that copies to a buffer the array elements on the original order before the algos starts. This is to avoid that after, using the algo, the array is sorted, and the next algo will use this already sorted array and not a unsorted one.
Note: The file BenchMark_100Elements_NoGen.zip shows what happens when RandomGen is not being used. My algo parses 10 times faster when he array is already sorted, while nrQsortA tries to sort all over again, even if they are already sorted
Hi Gustavo,
I've added your algo to the testbed, but it seems there is quadratic behaviour present. Even when taking only 2% of the elements used by the other algos, it is slower, sorry :(
Guga_Insort proc uses edi ebx Array:DWORD, ArrayLen:DWORD
mov ebx, 1
mov edi, Array
mov eax, ebx
sub eax, ArrayLen
jge OutFunction
MainLoop:
mov ecx, [edi+ebx*4]
InternalLoop:
mov eax, [edi+ebx*4-4]
cmp eax, ecx
jbe OutInternalLoop
mov [edi+ebx*4], eax
dec ebx
jnz InternalLoop
OutInternalLoop:
mov [edi+ebx*4], ecx
inc ebx
cmp ebx, ArrayLen
jb MainLoop
OutFunction:
ret
Guga_Insort endp
Damn...it was too good to be true :icon_mrgreen:
I´ll give a test on dualPivotQuicksort to see the results.
Btw.does your algo (or another from masmlib) sorts unsigned integers ?
(...)Edited. JJ, If you read this post before i edited, forget it. I suceeded to make masmbasic work properly :)
I forgot to also rename Jwasm to mlv615.exe and not only renaming it with ml.exe :t
Hi Gustavo,
Just checked, it seems the minimum content of \Masm32\bin\ is this:
ml.exe ; can be 6.15 ... 10.0, or a renamed JWasm.exe (better)
link.exe ; the old 5.12 linker that comes with Masm32
mspdb50.dll ; the old linker needs this one
Guga,
At an algo level the only difference between a signed and unsigned compare is the jump you use after the compare.
Tks you for the feedback guys
I fixed the algo and mixed with other one i´building
The results are:
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2)
Sorting 2000000 items:
DwRange=-1
224 ms for ArraySort MasmBasic
202 ms for nrQsortA Masm32 lib
388 ms for crt_qsort with callback
176 ms for Gugasedgesort
DwRange=-1
232 ms for ArraySort MasmBasic
208 ms for nrQsortA Masm32 lib
390 ms for crt_qsort with callback
174 ms for Gugasedgesort
DwRange=-1
221 ms for ArraySort MasmBasic
202 ms for nrQsortA Masm32 lib
390 ms for crt_qsort with callback
171 ms for Gugasedgesort
ok
JJ, can you see if the masm syntax of the algorithm is correct ?
If my porting to masm are correct, it is 15% faster then nrQsortA. Probably i can optimize it a bit more if i succeed to remove the recursion and merge the functions i used. I´ll also finish the dualpivot (tri partition quicksort) to compare the results
I´m analyzing mine algos and see if i can reach something around 80-100 ms (Probably this timing can be reached if i suceed to make dualpivot algorithm work as expected and mix it with the other algos i built: insort and partial_quickersort)
Btw, I forgot
Steve, which/Where change the JCC instruction to make nrQsortA to sort unsigned ?
I would like to compare the timmings for the algos works on the same way (I mean, unsorted and sorted) to check for the differences
I´m almost there at 80 ms :icon_mrgreen:
Updated gugasedgesort
Can someone test the results for accuracy ?
Btw...JJ, if the results are accurated, then the 120 ms can be even more faster, once i remove the recursion. I estimate a gain of 30% (at least) of speed without recursion
Definitely fast...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (MMX, SSE, SSE2, SSE3)
Sorting 2000000 items:
DwRange=-1
495 ms for ArraySort MasmBasic
457 ms for nrQsortA Masm32 lib
878 ms for crt_qsort with callback
248 ms for Gugasedgesort
DwRange=-1
499 ms for ArraySort MasmBasic
387 ms for nrQsortA Masm32 lib
820 ms for crt_qsort with callback
249 ms for Gugasedgesort
DwRange=-1
488 ms for ArraySort MasmBasic
388 ms for nrQsortA Masm32 lib
837 ms for crt_qsort with callback
264 ms for Gugasedgesort
Wahoo :icon_mrgreen: :icon_mrgreen:
Now, i´ll try to make it even faster.. I´m converting the dualpivot as i explained, and will later, remove the recursion on the gugasedgesort algo. I´m positive that gugasedgesort can be improved when removing the recursion and perhaps have a gain of 30% (or more) of it´s speed.
Also, once i made the tests on dualpivot, i´ll compare the results, and see if i can mix between gugasedgesort in order to force sedgesort to work with a tri partition technique. I hope i can make it work, because i´[m trying to achieve a speed of 40-80 ms on my I7 (Which means something around 120- 160 on your Celeron)
Btw, i´ll need to name the algo. "gugasedgesort" or GugaSomethign" is not exactly an algo´s name :icon_mrgreen:. I´m thinking in naming it as UltraSort or LightSpeed :icon_mrgreen:
For what it's worth, if I take the Microsoft qsort source that was distributed with my Server 2003 PSDK and hack it to sort 32-bit integers, the result is essentially twice as fast as the original (1021/511 on my P3, but only 125/78 on my Core-i3). The function is non-recursive, but does byte operations for everything (the explanation for this being to avoid alignment problems), and as you know calls an external comparison function from the inner loop (so it can be rigged to handle any data type). My modifications replaced the byte operations with integer operations, replaced the comparison function with a macro that is expanded inline, and did the same for the internal swap procedure.
I doubt that Microsoft would want me distributing their source file, hacked or otherwise, so I included only the object module in the attachment, along with my C test source and the EXE, compiled with the Visual C++ Toolkit 2003 compiler and /O2 /G6 optimizations.
MichaelW, M$ qsort code is distributed for free for research. Take a look at the Windows Research Kernel source code (If i remember well, it is in one of the M$ pages. You need to register in order to dl). Essentially what is inside msvcrt (client and server) can be way more improved as you did, but, still, are slower then the ones i´m testing/implementing.
Also, if i remember well, the source code for qsort uses cdecl calling convention, while a regular stdcall would be enough. If your modifications removed the recursion, there seems be no need for using cdecl. If you try to change to stdcall will it be a bit faster ? And you could, as well, remove the 5 alignments inside the code.
If i can be able to make this algos even faster, i can then, try to add a 3rd argument to be used as the "compare" argument of qsort, but, i´ll need to make all of this faster before give a try to add another argument. Using recursion, proved to be a bad way on those algos, since it slow down the speed, but, in some cases, we have no other way to go (well...at least, i didn´t touched the algo to remove the recursion yet)
Btw...can u test the benchmark i posted ? I would like to check the timmings on your P3
Gustavo,
the results from your last test program:
Intel(R) Xeon(R) CPU X3330 @ 2.66GHz (MMX, SSE, SSE2, SSE3, SSSE3, SS
E4.1)
Sorting 2000000 items:
DwRange=-1
392 ms for ArraySort MasmBasic
296 ms for nrQsortA Masm32 lib
517 ms for crt_qsort with callback
172 ms for Gugasedgesort
DwRange=-1
387 ms for ArraySort MasmBasic
312 ms for nrQsortA Masm32 lib
498 ms for crt_qsort with callback
154 ms for Gugasedgesort
DwRange=-1
386 ms for ArraySort MasmBasic
299 ms for nrQsortA Masm32 lib
553 ms for crt_qsort with callback
194 ms for Gugasedgesort
ok
Gunther
Quote from: guga on October 01, 2014, 11:52:41 AM
Btw...can u test the benchmark i posted ? I would like to check the timmings on your P3
These benchmarks?
10
Results 8 pass average
timing nrQsortA (Masm Lib) 26150 ms
timing sedgesort Guga6995 ms
timing insort Guga6316 ms
20
Results 8 pass average
timing nrQsortA (Masm Lib) 41865 ms
timing sedgesort Guga24775 ms
timing insort Guga24013 ms
30
Results 8 pass average
timing nrQsortA (Masm Lib) 84504 ms
timing sedgesort Guga55358 ms
timing insort Guga54557 ms
52
Results 8 pass average
timing nrQsortA (Masm Lib) 134079 ms
timing sedgesort Guga149397 ms
timing insort Guga146562 ms
100
Results 8 pass average
timing nrQsortA (Masm Lib) 289919 ms
timing sedgesort Guga472609 ms
timing insort Guga471819 ms
100 NoGen
Results 8 pass average
timing nrQsortA (Masm Lib) 143974 ms
timing sedgesort Guga9573 ms
timing insort Guga8752 ms
The above took hours on a 500MHz P3.
Tks guys :)
Btw...MichaelW, Did you tested the last one (ArraySortTimingsDword6a.zip ) ? For your results, i guess you tested the version that had problems with quadratic behaviour. Do´t use the ones from "NoGen". Those are the ones that had errors.
I fixed this on the last version (ArraySortTimingsDword6a.zip)
Can u test on a P3 this version below that i modified for your P3 ? I included only my algo to make the benchmark run faster. I would like to compare the results on a old P3.
ArraySortTimingsDword6a.zip and ArraySortTimingsDword6c.exe, both apparently using MasmBasic, require SSE2 support so they won't run on a P3.
deleted
Hi nidud,
the results:
41 18467 6334 26500 19169 15724 11478 29358 26962 24464
5705 28145 23281 16827 9961 491 2995 11942 4827 5436
qsort:
41 491 2995 4827 5436 5705 6334 9961 11478 11942
15724 16827 18467 19169 23281 24464 26500 26962 28145 29358
qsortmod:
41 491 2995 4827 5436 5705 6334 9961 11478 11942
15724 16827 18467 19169 23281 24464 26500 26962 28145 29358
sortdd:
41 491 2995 4827 5436 5705 6334 9961 11478 11942
15724 16827 18467 19169 23281 24464 26500 26962 28145 29358
qsort: 3100
qsortmod: 540
sortdd: 480
Gunther
There is an alternative to exchange based sorts if you know the range from lowest to highest value and it can be fitted into memory. They were long ago called "letter box" sorts and I have heard them called "pidgeon hole" sorts but the technique is much the same. If just for example you were working on a numeric range of 1 million, you allocate an array of 4 million bytes zero filled then scan through the data incrementing each array member that matches the number scanned as it is found.
Once the first scan is completed you then scan the target array and for non zero members and write them back to the original source sorted. If the range you are working on is limited, this technique will kill an exchange sort dead, they are so much faster.
Quote from: hutch-- on October 03, 2014, 11:01:12 AM
There is an alternative to exchange based sorts if you know the range from lowest to highest value and it can be fitted into memory. They were long ago called "letter box" sorts and I have heard them called "pidgeon hole" sorts ... If the range you are working on is limited, this technique will kill an exchange sort dead, they are so much faster.
Kind of bucket sorts, yes. My ArraySort() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1176) uses a variant called radix sort, which has no memory problems and beats nrQsortA up to a range of 400,000 or so (and crt qsort has no chance even for the full dword range).
Re the timings above: They are so sensational that you should get a patent for "sorting small arrays repeatedly". But check the timings for "sorting a fat array once" ;)
Jochen,
Quote from: jj2007 on October 03, 2014, 02:30:03 PM
They are so sensational that you should get a patent for "sorting small arrays repeatedly". But check the timings for "sorting a fat array once" ;)
these are very different things.
Gunther
Yes indeed. How is the weather in Berlin?
Jochen,
Quote from: jj2007 on October 03, 2014, 10:02:35 PM
Yes indeed. How is the weather in Berlin?
the weather is fine here: sunny and a warm wind. Good weather for a public holiday.
But back to the thread. Qsort needs in any case a good center for the data. That's necessary for such Divide et impera (divide and conquer) algorithm. Sedgwick has an own chapter for that point.
Gunther