News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Timings for fast median algo

Started by jj2007, July 24, 2020, 12:30:27 AM

Previous topic - Next topic

jj2007

Quote from: Siekmanski on July 25, 2020, 08:04:02 PM
My routine is not to blame, must be something else.  :cool:

You do it without normalisation: mov eax, (9488645+1) * 4, and that works :sad:

It's not a blame game, Marinus, I am just trying to understand why my version misbehaves for that specific case. Usually the deviation is in the 0.02% range, here it is 3%, a factor 150 higher...

Most probably it has to with resolution. The range here is 1332...9488645, and the real median is 38532, with is 0.4% of the max value. The distribution looks a bit extreme, see below. If I increase the buffer to 400,000 bytes, all is fine again.
179     bytes for ArrInf1
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Precision loss is relative to the 'real' median, which is obtained by sorting the DWORDs

11 numbers generated
0 ms to find the median of 11 items, which is 31328.00
0 ms to find the median of 11 items, which is 31328.00
0 ms to find the median of 11 items, which is 31328.00
0 ms for sorting, the real median is 31349.00
Speed gain is a factor 0.0245, precision loss is -0.0670%

110 numbers generated
0 ms to find the median of 110 items, which is 38522.00
0 ms to find the median of 110 items, which is 38522.00
0 ms to find the median of 110 items, which is 38522.00
0 ms for sorting, the real median is 38532.00
Speed gain is a factor 0.457, precision loss is -0.0260%

1000001 numbers generated
5 ms to find the median of 1 Mio items, which is 39751.00
5 ms to find the median of 1 Mio items, which is 39751.00
5 ms to find the median of 1 Mio items, which is 39751.00
100 ms for sorting, the real median is 39792.00
Speed gain is a factor 17.1, precision loss is -0.103%

10000010 numbers generated
60 ms to find the median of 10 Mio items, which is 39825.00
57 ms to find the median of 10 Mio items, which is 39825.00
57 ms to find the median of 10 Mio items, which is 39825.00
1048 ms for sorting, the real median is 39796.00
Speed gain is a factor 17.9, precision loss is 0.0729%

100000101 numbers generated
615 ms to find the median of 100 Mio items, which is 39820.00
587 ms to find the median of 100 Mio items, which is 39820.00
591 ms to find the median of 100 Mio items, which is 39820.00
9960 ms for sorting, the real median is 39796.00
Speed gain is a factor 16.7, precision loss is 0.0603%

Siekmanski

That wasn't my intention, probably misunderstood the situation.
Sorry, if I gave you that feeling.

I just allocated the minimum buffer size for the data set.
What do you mean exactly with the normalisation?
Creative coders use backward thinking techniques as a strategy.

jj2007

No bad feeling here, Marinus :biggrin:

What I am doing here is to adapt the algo to any range of values by normalising it:
aiPop:
  if aiInt
fild aiSize$ PTR [esi+aiSize*ecx] ; get original integer
  else
fld aiSize$ PTR [esi+aiSize*ecx] ; get original real
  endif
  fsub arrInf.yMin ; make positive
  fmul arrInf.yRange ; normalise 0...99999
  fistp arrInf.yIndex ; we need a
  mov eax, arrInf.yIndex ; reg32
  inc dword ptr [edi+4*eax] ; Populate the Median buffer with the values as index numbers and keep track for duplicates in one go
  dec ecx
  jns aiPop


That works fine, but (see image above) for very skewed distributions the deviation can be significant.

Results for the latest version (attached):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Precision loss is relative to the 'real' median, which is obtained by sorting the DWORDs

11 numbers generated
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms for sorting, the real median is 31349.00
Speed gain is a factor 0.0149, precision loss is -0.0670%

110 numbers generated
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms for sorting, the real median is 38532.00
Speed gain is a factor 0.444, precision loss is -0.0260%

1000001 numbers generated
8 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
7 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
8 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
112 ms for sorting, the real median is 39792.00
Speed gain is a factor 13.4, precision loss is -0.103%

10000010 numbers generated
83 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
87 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
84 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
1146 ms for sorting, the real median is 39796.00
Speed gain is a factor 13.5, precision loss is 0.0729%

100000101 numbers generated
848 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
842 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
841 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
10829 ms for sorting, the real median is 39796.00
Speed gain is a factor 12.8, precision loss is 0.0603%

Siekmanski

 :thumbsup:

Hi Jochen,

Now I understand what your goal is.
What about this idea to speed it up multiple times.
Multithreading.
Divide the data set in equal parts ( / max cpu thread count )
1 Median Buffer for each thread, when done adding the buffers together per index number and find the median, will speed it up.
Creative coders use backward thinking techniques as a strategy.

jj2007

Well.... at 842 ms to find the median of 100 Mio items, I am not eager to dive into the secrecies of multithreaded programs ;-)

Google How do you find the median of a large data set? to see this:
QuoteOrder the values from smallest to largest.
If the data set contains an odd number of values , choose the one that is exactly in the middle. You've found the median.
If the data set contains an even number of values , take the two values that appear in the middle and average them to find the median.

So they think sorting is the way to go, which implies that their idea of a "large" data set is different from ours :tongue:

Siekmanski

Sometimes, thinking outside of the box can give you a reward.
Most ideas are based on other ideas without understanding why the first idea was constructed that way.
Many ancient fast routines were designed to work for small memory sizes and slow memory reads and writes. Not a problem anymore.

I came up with this idea when I wrote a routine to find the unique number count of colors in a bitmap image.
When guga needed a fast median routine, I knew this would be fast, especially in a small range ( 0-255 ) and millions of colors.

Later it turned out to be for only 100 colors per iteration.
Creative coders use backward thinking techniques as a strategy.

jj2007

This is the goal :cool:
Dim MyArray() As REAL8   ; or DWORD, REAL4, ...
... fill it with values ...
Print Str$("The array has %i elements\n", MyArray(?))
Print Str$("The median is %f (a proxy, typically 0.02% off) \n", MyArray(median?))
Print Str$("The average is %f\n", MyArray(average?))
Print Str$("The minimum is %f\n", MyArray(min?))
Print Str$("The maximum is %f\n", MyArray(max?))

daydreamer

I think multithreading could be a good idea

maybe would be useful for Guga to have extra thread dedicated to filereading/filewriting in background,instead of main thread switches between fast algos running median/image/videoframe processing and to slow write processed images/data to disk when RAM is filled and readin unprocessed images/data ,to start the main loop again
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

guga

Multi-thread could be a good way to go, but 1st we need to see if the watermark remover algo will work as expected (i presume it will, but we didn´t get there yet :greensml: :greensml: :greensml:)

Btw....Someone has the timming of quicksort algo ? I would like to compare it with one thing i´m finishing here and will post soon :azn: :azn: :azn:.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

In the MASM  lib you can find Qsort routines.
Creative coders use backward thinking techniques as a strategy.


guga

HI Marinus...I know, but i wanted just to compare the results of qsort and another sorting algo called Bitonic sort. I opened a thread and posted the source (in masm too) to benchmark later.

Btw... How do i exchange the order of 2 floats in SSE from different registers ? Ex:

xmm1 = 1, 2, 3, 4
xmm2 = 5, 6, 7, 8

How do i exchange the order of the low and hi qword on both of them at once in a way that result in:

xmm1 = 1, 2, 5, 6
xmm2 = 3, 4, 6, 7
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

LiaoMi

Quote from: jj2007 on July 25, 2020, 09:14:41 PM
No bad feeling here, Marinus :biggrin:

What I am doing here is to adapt the algo to any range of values by normalising it:
aiPop:
  if aiInt
fild aiSize$ PTR [esi+aiSize*ecx] ; get original integer
  else
fld aiSize$ PTR [esi+aiSize*ecx] ; get original real
  endif
  fsub arrInf.yMin ; make positive
  fmul arrInf.yRange ; normalise 0...99999
  fistp arrInf.yIndex ; we need a
  mov eax, arrInf.yIndex ; reg32
  inc dword ptr [edi+4*eax] ; Populate the Median buffer with the values as index numbers and keep track for duplicates in one go
  dec ecx
  jns aiPop


That works fine, but (see image above) for very skewed distributions the deviation can be significant.

Results for the latest version (attached):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Precision loss is relative to the 'real' median, which is obtained by sorting the DWORDs

11 numbers generated
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms for sorting, the real median is 31349.00
Speed gain is a factor 0.0149, precision loss is -0.0670%

110 numbers generated
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms for sorting, the real median is 38532.00
Speed gain is a factor 0.444, precision loss is -0.0260%

1000001 numbers generated
8 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
7 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
8 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
112 ms for sorting, the real median is 39792.00
Speed gain is a factor 13.4, precision loss is -0.103%

10000010 numbers generated
83 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
87 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
84 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
1146 ms for sorting, the real median is 39796.00
Speed gain is a factor 13.5, precision loss is 0.0729%

100000101 numbers generated
848 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
842 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
841 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
10829 ms for sorting, the real median is 39796.00
Speed gain is a factor 12.8, precision loss is 0.0603%


179     bytes for ArrInf1
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Precision loss is relative to the 'real' median, which is obtained by sorting the DWORDs

11 numbers generated
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms to find the median of 11 items, which is 31328.00 (min=8260, max=5746454)
0 ms for sorting, the real median is 31349.00
Speed gain is a factor 0.0166, precision loss is -0.0670%

110 numbers generated
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms to find the median of 110 items, which is 38522.00 (min=1332, max=9488645)
0 ms for sorting, the real median is 38532.00
Speed gain is a factor 0.488, precision loss is -0.0260%

1000001 numbers generated
7 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
6 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
6 ms to find the median of 1 Mio items, which is 39751.00 (min=-12313, max=9999999)
91 ms for sorting, the real median is 39792.00
Speed gain is a factor 14.1, precision loss is -0.103%

10000010 numbers generated
70 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
72 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
72 ms to find the median of 10 Mio items, which is 39825.00 (min=-12340, max=9999995)
955 ms for sorting, the real median is 39796.00
Speed gain is a factor 13.3, precision loss is 0.0729%

100000101 numbers generated
712 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
678 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
686 ms to find the median of 100 Mio items, which is 39820.00 (min=-12345, max=9999999)
8931 ms for sorting, the real median is 39796.00
Speed gain is a factor 12.9, precision loss is 0.0603%
hit any key 

Siekmanski

Quote from: guga on July 26, 2020, 06:43:46 PM
HI Marinus...I know, but i wanted just to compare the results of qsort and another sorting algo called Bitonic sort. I opened a thread and posted the source (in masm too) to benchmark later.

Btw... How do i exchange the order of 2 floats in SSE from different registers ? Ex:

xmm1 = 1, 2, 3, 4
xmm2 = 5, 6, 7, 8

How do i exchange the order of the low and hi qword on both of them at once in a way that result in:

xmm1 = 1, 2, 5, 6
xmm2 = 3, 4, 6, 7

With one or more of these beauties,

movlps movhps
shufps pshufd
unpcklps unpckhps

Play with them, once you understand how they work, they will be your best friends forever.  :eusa_dance:
Creative coders use backward thinking techniques as a strategy.