News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Create a file with RANDOM dwords

Started by jj2007, September 29, 2014, 07:48:21 AM

Previous topic - Next topic

jj2007

Definitely fast...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (MMX, SSE, SSE2, SSE3)
Sorting 2000000 items:
DwRange=-1
495 ms for ArraySort    MasmBasic
457 ms for nrQsortA     Masm32 lib
878 ms for crt_qsort    with callback
248 ms for Gugasedgesort
DwRange=-1
499 ms for ArraySort    MasmBasic
387 ms for nrQsortA     Masm32 lib
820 ms for crt_qsort    with callback
249 ms for Gugasedgesort
DwRange=-1
488 ms for ArraySort    MasmBasic
388 ms for nrQsortA     Masm32 lib
837 ms for crt_qsort    with callback
264 ms for Gugasedgesort

guga

Wahoo  :icon_mrgreen: :icon_mrgreen:

Now, i´ll try to make it even faster.. I´m converting the dualpivot as i explained, and will later, remove the recursion on the gugasedgesort algo. I´m positive that gugasedgesort can be improved when removing the recursion and perhaps have a gain of 30% (or more) of it´s speed.

Also, once i made the tests on dualpivot, i´ll compare the results, and see if i can mix between gugasedgesort in order to force sedgesort to work with a tri partition technique. I hope i can make it work, because i´[m trying to achieve a speed of 40-80 ms on my I7 (Which means something around 120- 160 on your Celeron)

Btw, i´ll need to name the algo. "gugasedgesort" or GugaSomethign" is not exactly an algo´s name :icon_mrgreen:. I´m thinking in naming it as UltraSort or LightSpeed  :icon_mrgreen:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

MichaelW

For what it's worth, if I take the Microsoft qsort source that was distributed with my Server 2003 PSDK and hack it to sort 32-bit integers, the result is essentially twice as fast as the original (1021/511 on my P3, but only 125/78 on my Core-i3). The function is non-recursive, but does byte operations for everything (the explanation for this being to avoid alignment problems), and as you know calls an external comparison function from the inner loop (so it can be rigged to handle any data type). My modifications replaced the byte operations with integer operations, replaced the comparison function with a macro that is expanded inline, and did the same for the internal swap procedure.

I doubt that Microsoft would want me distributing their source file, hacked or otherwise, so I included only the object module in the attachment, along with my C test source and the EXE, compiled with the Visual C++ Toolkit 2003 compiler and /O2 /G6 optimizations.
Well Microsoft, here's another nice mess you've gotten us into.

guga

 MichaelW, M$ qsort code is distributed for free for research. Take a look at the Windows Research Kernel source code (If i remember well, it is in one of the M$ pages. You need to register in order to dl). Essentially what is inside msvcrt (client and server) can be way more improved as you did, but, still, are slower then the ones i´m testing/implementing.

Also, if i remember well, the source code for qsort uses cdecl calling convention, while a regular stdcall would be enough. If your modifications removed the recursion, there seems be no need for using cdecl. If you try to change to stdcall will it be a bit faster ? And you could, as well, remove the 5 alignments inside the code.

If i can be able to make this algos even faster, i can then, try to add a 3rd  argument to be used as the "compare" argument of qsort, but, i´ll need to make all of this faster before give a try to add another argument. Using recursion, proved to be a bad way on those algos, since it slow down the speed, but, in some cases, we have no other way to go (well...at least, i didn´t touched the algo to remove the recursion yet)

Btw...can u test the benchmark i posted ? I would like to check the timmings on your P3
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Gunther

Gustavo,

the results from your last test program:

Intel(R) Xeon(R) CPU           X3330  @ 2.66GHz (MMX, SSE, SSE2, SSE3, SSSE3, SS
E4.1)
Sorting 2000000 items:
DwRange=-1
392 ms for ArraySort    MasmBasic
296 ms for nrQsortA     Masm32 lib
517 ms for crt_qsort    with callback
172 ms for Gugasedgesort
DwRange=-1
387 ms for ArraySort    MasmBasic
312 ms for nrQsortA     Masm32 lib
498 ms for crt_qsort    with callback
154 ms for Gugasedgesort
DwRange=-1
386 ms for ArraySort    MasmBasic
299 ms for nrQsortA     Masm32 lib
553 ms for crt_qsort    with callback
194 ms for Gugasedgesort
ok


Gunther
You have to know the facts before you can distort them.

MichaelW

Quote from: guga on October 01, 2014, 11:52:41 AM
Btw...can u test the benchmark i posted ? I would like to check the timmings on your P3

These benchmarks?

10
Results 8 pass average
timing nrQsortA (Masm Lib) 26150 ms
timing sedgesort Guga6995 ms
timing insort Guga6316 ms
20
Results 8 pass average
timing nrQsortA (Masm Lib) 41865 ms
timing sedgesort Guga24775 ms
timing insort Guga24013 ms
30
Results 8 pass average
timing nrQsortA (Masm Lib) 84504 ms
timing sedgesort Guga55358 ms
timing insort Guga54557 ms
52
Results 8 pass average
timing nrQsortA (Masm Lib) 134079 ms
timing sedgesort Guga149397 ms
timing insort Guga146562 ms
100
Results 8 pass average
timing nrQsortA (Masm Lib) 289919 ms
timing sedgesort Guga472609 ms
timing insort Guga471819 ms
100 NoGen
Results 8 pass average
timing nrQsortA (Masm Lib) 143974 ms
timing sedgesort Guga9573 ms
timing insort Guga8752 ms

The above took hours on a 500MHz P3.
Well Microsoft, here's another nice mess you've gotten us into.

guga

Tks guys :)


Btw...MichaelW, Did you tested the last one (ArraySortTimingsDword6a.zip ) ? For your results, i guess you tested the version that had problems with quadratic behaviour. Do´t use the ones from "NoGen". Those are the ones that had errors.

I fixed this on the last version (ArraySortTimingsDword6a.zip)

Can u test on a P3 this version below that i modified for your P3 ? I included only my algo to make the benchmark run faster. I would like to compare the results on a old P3.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

MichaelW

ArraySortTimingsDword6a.zip and ArraySortTimingsDword6c.exe, both apparently using MasmBasic, require SSE2 support so they won't run on a P3.
Well Microsoft, here's another nice mess you've gotten us into.

nidud

#23
deleted

Gunther

Hi nidud,

the results:

41      18467   6334    26500   19169   15724   11478   29358   26962   24464
5705    28145   23281   16827   9961    491     2995    11942   4827    5436


qsort:
41      491     2995    4827    5436    5705    6334    9961    11478   11942
15724   16827   18467   19169   23281   24464   26500   26962   28145   29358

qsortmod:
41      491     2995    4827    5436    5705    6334    9961    11478   11942
15724   16827   18467   19169   23281   24464   26500   26962   28145   29358

sortdd:
41      491     2995    4827    5436    5705    6334    9961    11478   11942
15724   16827   18467   19169   23281   24464   26500   26962   28145   29358

qsort:          3100
qsortmod:       540
sortdd:         480


Gunther
You have to know the facts before you can distort them.

hutch--

There is an alternative to exchange based sorts if you know the range from lowest to highest value and it can be fitted into memory. They were long ago called "letter box" sorts and I have heard them called "pidgeon hole" sorts but the technique is much the same. If just for example you were working on a numeric range of 1 million, you allocate an array of 4 million bytes zero filled then scan through the data incrementing each array member that matches the number scanned as it is found.

Once the first scan is completed you then scan the target array and for non zero members and write them back to the original source sorted. If the range you are working on is limited, this technique will kill an exchange sort dead, they are so much faster.

jj2007

Quote from: hutch-- on October 03, 2014, 11:01:12 AM
There is an alternative to exchange based sorts if you know the range from lowest to highest value and it can be fitted into memory. They were long ago called "letter box" sorts and I have heard them called "pidgeon hole" sorts ...  If the range you are working on is limited, this technique will kill an exchange sort dead, they are so much faster.

Kind of bucket sorts, yes. My ArraySort() uses a variant called radix sort, which has no memory problems and beats nrQsortA up to a range of 400,000 or so (and crt qsort has no chance even for the full dword range).

Re the timings above: They are so sensational that you should get a patent for "sorting small arrays repeatedly". But check the timings for "sorting a fat array once" ;)

Gunther

Jochen,

Quote from: jj2007 on October 03, 2014, 02:30:03 PM
They are so sensational that you should get a patent for "sorting small arrays repeatedly". But check the timings for "sorting a fat array once" ;)

these are very different things.

Gunther
You have to know the facts before you can distort them.

jj2007

Yes indeed. How is the weather in Berlin?

Gunther

Jochen,

Quote from: jj2007 on October 03, 2014, 10:02:35 PM
Yes indeed. How is the weather in Berlin?

the weather is fine here: sunny and a warm wind. Good weather for a public holiday.

But back to the thread. Qsort needs in any case a good center for the data. That's necessary for such Divide et impera (divide and conquer) algorithm. Sedgwick has an own chapter for that point.

Gunther
You have to know the facts before you can distort them.