Author Topic: Fast Compare Real8 with SSE and ColorSpaces  (Read 6922 times)

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Fast Compare Real8 with SSE and ColorSpaces
« on: January 29, 2019, 02:31:06 PM »
Hi Guys

Somebody suceeded to create a faster way to find a Real8 value among a table of 256 ?

The problems is like thiss:

given a table of 256 Real8 values from 0 to 100 (And all in crescent order from 0 to 100 only)

Btw: R$ stands for Real8 notation ;)

[MyTable:
 Data0: R$  0 ; Index = 0
 Data1: R$ 4.89e-2 ; Index = 1
 Data2: R$ 9.79e-2 ; Index = 2
 Data3: R$ 1.46e-1 ; Index = 3
 Data4: R$ 1.95e-1 ; Index = 4
(...)
 Data78: R$ 9.44
 Data79: R$ 9.67
(...)
 Data150: R$ 33.6 ; Index = 150
 Data151: R$ 34.13 ; Index = 151
(...)
 Data254: R$ 99.18 ; Index = 254
 Data255: R$ 100] ; Index = 255

MyValue = R$ 78.5212


The goal is to find the value "78.5212" among the 256 Real8 Array (In crescent order) and returns the corresponding index of it.

I made a routine for that, but it is kinda slow, though :redface: :redface: :redface: Not sure how to use SSE to make it faster

Code: [Select]

    ; ebx points to the end of the array. So points to "Data255"

    Fpu_If R@MyValue >= R$ebx
        mov eax 255 | jmp L1> ; Found value on the last array (the 255th one). Set it´s index to eax
    Fpu_End_If

    mov ecx 254
    Do

        Fpu_If_And R@MyValue < R$ebx, R@MyValue >= R$ebx-8   ;If value was found in between the current and the previous, we have a match
            mov eax ecx | jmp L1> ; Found value and set it´s index to eax
        Fpu_End_If
        sub ebx 8
        dec ecx
    Loop_Until ecx <= 0
L1:

« Last Edit: March 10, 2019, 04:36:53 AM by guga »
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9687
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #1 on: January 29, 2019, 03:24:21 PM »
Option 0 below is probably the fastest solution (project attached):

include \masm32\MasmBasic\MasmBasic.inc         ; download

table REAL8 1234567890.1, 1234567890.2, 1234567890.3, 1234567890.4, 1234567890.5, 1234567890.6

  Init
  fld FP8(1234567890.5)   ; we want to see Match #5
  sub esp, REAL8
  fstp REAL8 ptr [esp]
  movlps xmm1, REAL8 ptr [esp]
  mov esi, offset table
  xor ecx, ecx
  if 1  ; 1=use Fcmp, 0=use pcmpeqb & friends
       .Repeat
                movlps xmm0, [esi+REAL8*ecx]
                inc ecx
                Fcmp xmm0, xmm1, high   ; high = be reasonably strict; try top or medium
       .Until Zero? || ecx>5
  else
        .Repeat
                movlps xmm0, [esi+REAL8*ecx]
                pcmpeqb xmm0, xmm1
                pmovmskb eax, xmm0
                inc ecx
        .Until ecx>5 || ax==0FFFFh
  endif
  .if Zero?
        Inkey Str$("Match at index %i", ecx)
  .else
        Inkey "No match, sorry"
  .endif
EndOfCode


Option 1, Fcmp, is a bit slower, but the problem here is that the fast solution requires an exact match. Real world examples rarely have such exact matches. What do you need it for?

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #2 on: January 29, 2019, 04:26:57 PM »
Tks a lot, Jochen :t :t :t

I´ll give it a try It´s for a two functions i´m making to convert from RGB to CieLCH (and vice-versa). I´m also studying how you managed to make a faster sin and cosine function. So far, the functions i built are fast for image processing, but still not fast enough for video. With the help of those optimizations i hope the functions can be improved. If i succeed to make it work as expected i´ll build a library and post it here for you :)

I was a bit surprised about the power of the CieLCH conversions and it´s precision and started to give a try on then :) I was amazed whern found out after doing the maths envolved with those colorspaces that, in fact we can get rid of tons of useless colors combinations that are, in fact, indistinguishable from each other. I already succeeded to reduce the luminance values to fit´s to the range of 0 to 255 and it seems that for chroma and hue it has some restrictions as well. It seems that chrom and hue are related to each other on every luminance value from a table. So it would be much easier to perform the conversions and, at the same time, make the general aspect of a image (or video) be more consistent to what the CieLCH colorspace is actually doing.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #3 on: January 29, 2019, 11:05:18 PM »
Hi JJ

I tried the pcmpeqb xmm0, xmm1 but it´s not working. It´s oly exiting the loop when both values are exactly the  same.

The goal is to exit the loop when the value in MyTable is bigger or equal to the inputed value "1234567890.5"

For instance, if the difference between them are more then 8 digits, the routine dn´t exit the loop and don´t find the number.

I also tried with pcmpgtd/movlpd but no sucess. I did this:

Code: [Select]
   mov ebx MYTable

    movlpd xmm1 X@ValuetoFind
    xor ecx ecx
    Do
        movlpd XMM0 X$ebx+ecx*8
        pcmpgtd XMM0 XMM1
        pmovmskb eax XMM0
        If eax <> 0
            dec ecx | jmp L1>
        End_If
        inc ecx
    Loop_Until ecx > 255

Ex:
[MyTable: R$ 0 ; index1
2 ; index2
4 ; index3
6 ; index 4
8] ; index 5

MyNumber = 3

Then it will exit the loop when the value was found  in Between 2 and 4. In, short, when it reaches "4" it will exit the loop, since 4 is bigger then 3 (Also needs to work for equal values as well and not only bigger then)

A question, why comparing also ax with -1 ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 6660
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Fast Compare Real8 with SSE
« Reply #4 on: January 30, 2019, 01:44:10 AM »
Guga,

See if "comisd" is more suited to the comparison you need.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

daydreamer

  • Member
  • ****
  • Posts: 907
  • watch Chebyshev on the backside of the Moon
Re: Fast Compare Real8 with SSE
« Reply #5 on: January 30, 2019, 03:18:42 AM »
Hutch is right comissd for conditional jump
Organize in Search tree maybe faster?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D

jj2007

  • Member
  • *****
  • Posts: 9687
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #6 on: January 30, 2019, 03:37:13 AM »
COMISD works but it is exactly as fast as the pcmpeqb/pmovmskb combi. Besides, the problem remains that equality means for both variants 53 identical bits in the mantissa. If the matches are guaranteed to be exactly equal at REAL8 level, fine, otherwise you need to allow for a delta as in Fcmp.

Below results for a 50 Million elements array of random doubles in the range -10.0 ... +10.0
The first run has an exact match at the end of the array.
For the second run, the search value was increased by a tiny amount.

The issue is tricky and requires a good task definition. My impression is that working with integers would be a wiser strategy.

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
last elements of double array:
49999999        -1.09845799859612
50000000        -6.69932428747415
50000001        0.0

Now trying to find a match for -6.699324287474150097
Finding match at pos 50000000 (-6.699324287474150097) took 825 ms using Fcmp
Finding match at pos 50000000 (-6.699324287474150097) took 70 ms using pcmpeqb
Finding match at pos 50000000 (-6.699324287474150097) took 70 ms using comisd

Now trying to find a match for -6.698733009397984439
Finding match at pos 20284227 (-6.698732967488462364) took 346 ms using Fcmp
No match using pcmpeqb, sorry
No match using comisd, sorry

AW

  • Member
  • *****
  • Posts: 2318
  • Let's Make ASM Great Again!
Re: Fast Compare Real8 with SSE
« Reply #7 on: January 30, 2019, 04:22:17 AM »
I don't think SSE will be advantageous, what we need is search using Binary Search. It will get there in a maximum of 9 iterations.

Code: [Select]
H:\temp>binsearch
Searching for 937.5896481215857 (position 237)
Found 937.5896481215857 in position 237 in 7 iterations

H:\temp>binsearch
Searching for 477.70622882778406 (position 114)
Found 477.70622882778406 in position 114 in 8 iterations

H:\temp>binsearch
Searching for 945.06668294320502 (position 246)
Found 945.06668294320502 in position 246 in 8 iterations

H:\temp>binsearch
Searching for 532.82265694143496 (position 122)
Found 532.82265694143496 in position 122 in 8 iterations

H:\temp>binsearch
Searching for 995.36118655964844 (position 255)
Found 995.36118655964844 in position 255 in 9 iterations

H:\temp>binsearch
Searching for 552.75124362926124 (position 131)
Found 552.75124362926124 in position 131 in 6 iterations

H:\temp>binsearch
Searching for 526.77999206518757 (position 140)
Found 526.77999206518757 in position 140 in 8 iterations

H:\temp>binsearch
Searching for 98.544267097994933 (position 16)
Found 98.544267097994933 in position 16 in 8 iterations

H:\temp>binsearch
Searching for 617.23685415204318 (position 149)
Found 617.23685415204318 in position 149 in 7 iterations

H:\temp>binsearch
Searching for 205.78630939664907 (position 52)
Found 205.78630939664907 in position 52 in 8 iterations

H:\temp>binsearch
Searching for 672.10913418988616 (position 184)
Found 672.10913418988616 in position 184 in 8 iterations

H:\temp>binsearch
Searching for 250.09918515579699 (position 60)
Found 250.09918515579699 in position 60 in 8 iterations

H:\temp>binsearch
Searching for 754.02081362346257 (position 193)
Found 754.02081362346257 in position 193 in 7 iterations

H:\temp>binsearch
Searching for 277.16910306100652 (position 69)
Found 277.16910306100652 in position 69 in 7 iterations

H:\temp>binsearch
Searching for 787.62169255653555 (position 202)
Found 787.62169255653555 in position 202 in 8 iterations

H:\temp>binsearch
Searching for 320.99368266853844 (position 78)
Found 320.99368266853844 in position 78 in 8 iterations

H:\temp>binsearch
Searching for 846.88863795892212 (position 210)
Found 846.88863795892212 in position 210 in 8 iterations

H:\temp>binsearch
Searching for 350.0167851802118 (position 87)
Found 350.0167851802118 in position 87 in 5 iterations

The attachment contains source in C and .exe . It is easy to convert to ASM.

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #8 on: January 30, 2019, 04:26:30 AM »
Many Tks, Steve and JJ :)

Steve, i gave a try on comisd and comis but, it seems that they are slower then the regular fcomip accordlying to https://stackoverflow.com/questions/37766131/intel-x86-64-assembly-compare-signed-double-precision-floats.
So i did a variation of the macro i use for Fpu Comparitions, and this worked but, i´m not sure yet about speed yet :icon_rolleyes:

Code: [Select]
(...)
    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R$ecx

    ; After computing the values to be found, let´s scan on our look up table

    lea ebx D$esi+WS_Matrix.KFactorMapDis ; Points to the look up table
    mov edi D@Red ; Output result in "Red" Variable
    fld R@TmpRedDis | fmul R$Float100 | fstp R@TmpRedDis

    fld R@TmpRedDis
    xor ecx ecx
    Do
        fld R$ebx+ecx*8 | fcomip ST0 ST1 | jna L0>
            ffree ST0
            dec ecx | jmp L1>
        L0:
        inc ecx
    Loop_Until ecx > 255
    dec ecx
L1:
    mov D$edi ecx


JJ, i can´t work with integer, unfortunatelly. Not on this part of the code. What returns in "R@TmpRedDis" is, in fact, the result of a computation of the red pixel, so it is a fraction from 0 to 100. Each value on the table i´m using contains the result of "TmpRedDis" whose value is computed on a power basics. So, TempRed = XXX*YYY +1/(z^C) etc etc. The good thing is that the table contains 256 Real8 which values are ordered from 0 to 100 (in real8)...so, 0.04456456, 0.689798, 1.23258., 1.989, ....100 Which makes the searching  faster to scan, i presume.

I´m trying to replace the function to findd the proper value of TempRed with a simple scan to it´s look up table rather then using the regular function to compute it  directly.

The direct function uses a power of 2.4 to retrieve the correct (final) value of RedPixel (originated from TmpRedDis).

The "direct" function is like this:

Code: [Select]
(...)

    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R$ecx

    ; After computing the values to be found, let´s do the computations to find red, Green and Blue using "Power of" maths

    lea ecx D@TmpRedDis
    .Fpu_If R$ecx > R$esi+WS_Matrix.GammaDecode.TresholdDis
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$esi+WS_Matrix.GammaDecode.GammaDis | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$esi+WS_Matrix.GammaDecode.OffsetPlusDis | fsub R$esi+WS_Matrix.GammaDecode.OffsetDis
    .Fpu_Else
         ; Here is faster, but happens only when "index" (Pixel color) is (in general) smaller then 12.12354(or something..this value is not fixed, although it is small). So, the odds of a pixel falls on this jmp routine are 12/256.
        fld R$ecx | fmul R$esi+WS_Matrix.GammaDecode.SlopeDis
    .Fpu_End_If
    mov ecx D@Red | fmul R$Float255 | fistp F$ecx

See ? The function makes usage of logaritm FPU, scaling etc etc...All of this seems to be slower then a loop scan on the LookupTable, though.

Didn´t have time to test for speed yet, but it seems that a direct computation is slower then a simple scan on the  256 Real8 values.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #9 on: January 30, 2019, 04:27:49 AM »
Hi Aw, a binary search ? Hmm...indeed..I´l take a look. Many tks :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

  • Member
  • ****
  • Posts: 907
  • watch Chebyshev on the backside of the Moon
Re: Fast Compare Real8 with SSE
« Reply #10 on: January 30, 2019, 05:41:26 AM »
I am trying something similar, but enhance it with lookup one double and the following entry in the data and calculate mean value of two,or more advanced calculation based also on which number it is closer affect it more than the one that its farther to
 
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D

jj2007

  • Member
  • *****
  • Posts: 9687
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #11 on: January 30, 2019, 06:44:28 AM »
OK, now I see it's an entirely different story. It could be done with a simple cmp eax, [esi+8*ecx] - assuming that the low DWORD is unique in that table of 256.

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #12 on: January 30, 2019, 12:20:26 PM »
Hi Jochen.

I´m not sure if the Low Dword values on the table are unique. I´ll take a look at them :t :t :t. But...even if they are unique, how a look up table containing only the low dwords  will work better ?

The main routine to calculate those values used in the KFactorMap are computed way before the main converter (RGBtoCieLCH and CieLCHtoRGB) starts. The values are calculated from the gamma and offset  related to each monitor/camera/colorspace etc (Adobe1998, Bruce, SMTP etc etc)

Code: [Select]
    ; and create our KFactor Map for each color. Put the resultant values on the main Structure to form the table

    xor ecx ecx
    Do
        call FindKFactor ecx, D@pOutStruct
        inc ecx
    Loop_Until ecx > 255

And the main routine is at:
Code: [Select]

; RGBtoCieLCH_Ex
Proc FindKFactor:
    Arguments @Color, @pMatrix
    Structure @TempStorage 64, @pKfactorColorDis 0, @pKFactorWhiteRefXDis 8, @pKFactorWhiteRefYDis 16, @pKFactorWhiteRefZDis 24, @TmpGreenDis 32, @TmpBlueDis 40, @pAFactorDis 48, @pBFactorDis 56
    Uses esi, ebx, ecx, edx, edi

    finit

    lea ecx D@pKfactorColorDis | fild F@Color | fmul R$FloatOne_255 | fstp R$ecx

    mov esi D@pMatrix

    ; Points to the Start of KfactorMap array
    mov ebx esi | add ebx WS_Matrix.KFactorMapDis ; Points to the start of the KFactor Map array in WS_Matrix Structure
    xor edx edx
    mov eax 8;Size_Of_KFactorMap
    mul D@Color
    add ebx eax


    ; newColorRed = ((R@Red+0.055)/1.055)^2.4 ; where2.4 = gamma, 0.055 = Offset, 1.055 = Offset+1. Better precalculating them as i did.
    lea ecx D@pKfactorColorDis
    .Fpu_If R$ecx > R$esi+WS_Matrix.GammaEncode.TresholdDis
        fld R$esi+WS_Matrix.GammaEncode.GammaDis | fld R$esi+WS_Matrix.GammaEncode.OffsetDis | fadd R$ecx | fmul R$esi+WS_Matrix.GammaEncode.OffsetPlusDis | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
    .Fpu_Else
        fld R$ecx | fdiv R$esi+WS_Matrix.GammaEncode.SlopeDis
    .Fpu_End_If
    fmul R$Float100 | fstp R$ebx;+KFactorMap.ColorDis ; Original i wrote it as KFactorRed, KfactorGreen, KfactorBlue, but, in fact it is only 1 color/pixel to check. So... 256 in total

EndP


These functions are giving me a tremendous headache. I just found out that the threshold must be analysed (On the main convertiopn function CieLCHtoRGB and not the above one;) ) . When i convert colors such as Red = 12, Green = 14, Blue = 0, the resultant vaules after tyhe multiplication of the matrices are below the "WS_Matrix.GammaDecode.Treshold". So i´ll need to try to find a integer value and put that liek another members of the structure i´m using to create the table. Once i found if the threshold can be converted to a integer, it would be eaiser to make the check doing the convertion back after the matrix multiplication.

Something like:

Code: [Select]
mov eax D$esi+ThresholdInteger
fld R@TmpRedDis | fmul R$esi+WS_Matrix.GammaDecode.SlopeDis | fmul R$Float255 | fistp D@TempColorCheck
If D@TempColorCheck < eax
      return D@TempColorCheck ; the color of this pixel was found under the treshold.
Else
     ; perform check on the loop up table (Binary search, or whatever other method)
EndIf


I hope i can find a way to create a integer related to this threshold. The only problem i see is that we will need extra coding (mul by slope and mul by 255) to pixels that are found in less then 13% of the cases.

A true pain make those functions works correctly and to get things worst, all the papers i read so far are utterly complicated and tends to put different equations to create the very same thing, rather then simply trying to merge those equations together and then try to simplify all of this.

At least i´ve got the function works (well..kind of :icon_mrgreen: :icon_mrgreen:) and the only thing that needs to be done is optimize all of this.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

AW

  • Member
  • *****
  • Posts: 2318
  • Let's Make ASM Great Again!
Re: Fast Compare Real8 with SSE
« Reply #13 on: January 30, 2019, 08:31:57 PM »
All right, finally a MASM solution in this thread (which is derived from my initial C solution  :lol:) adjusting for the requirement that the search value doesn't need to be exact.

So, I adjusted the original C algo like this:

Code: [Select]
int binarySearch(double sortedArray[], int first, int last, double key, int *iterations)
{

while (first <= last) {
int mid = first + (last-first) / 2;
(*iterations)++;
if (key == sortedArray[mid])
return mid;
else if (key < sortedArray[mid])
last = mid - 1;
else {
if (key < sortedArray[mid + 1])
return mid;
else
first = mid + 1;

}
}
return -1;
}

and can be translated into an ASM module like this, after deoptimizing as needed the compiler ASM output:
Code: [Select]
.model flat, C
option casemap :none

.code

binSearch proc uses esi edi ebx sortedArray:ptr, first:dword, last:dword, _key:real8, iteration:ptr
fld QWORD PTR _key
mov edi, last
mov esi, first
mov ebx, iteration

@startLoop:
inc dword ptr[ebx]

; if (key == sortedArray[mid])
fld ST(0) ;  duplicate stack
mov eax, edi
sub eax, esi
cdq
sub eax, edx
mov ecx, eax
sar ecx, 1
add ecx, esi
mov edx, sortedArray
fld QWORD PTR [edx+ecx*8]
fucompp
fnstsw ax
test ah, 68 ; 00000044H
jnp @exit
fcom QWORD PTR [edx+ecx*8]
fnstsw ax
test ah, 5
jp SHORT @checkElse

lea edi, DWORD PTR [ecx-1]
jmp SHORT @endLoop

@checkElse:
fcom QWORD PTR [edx+ecx*8+8]
fnstsw ax
test ah, 5
jnp SHORT @exit
lea esi, DWORD PTR [ecx+1]

@endLoop:
cmp esi, edi
jle SHORT @startLoop

@exit:
fstp ST(0)
mov eax, ecx

ret
binSearch endp

end

So, lets test 100 times:

Code: [Select]
Searching for 393.98684652241582 (position 104)
Found 392.98684652241582 in position 104 in 8 iterations
Searching for 1.488296151615955 (position 0)
Found 0.48829615161595508 in position 0 in 8 iterations
Searching for 343.86123233741267 (position 91)
Found 340.86123233741267 in position 91 in 6 iterations
Searching for 866.07766960661638 (position 218)
Found 865.07766960661638 in position 219 in 6 iterations
Searching for 768.01458784752947 (position 198)
Found 766.01458784752947 in position 198 in 8 iterations
Searching for 457.86919766838588 (position 125)
Found 457.86919766838588 in position 125 in 7 iterations
Searching for 388.74477370525221 (position 102)
Found 388.74477370525221 in position 102 in 8 iterations
Searching for 180.82216864528337 (position 51)
Found 180.82216864528337 in position 51 in 6 iterations
Searching for 427.22861415448472 (position 114)
Found 427.22861415448472 in position 114 in 8 iterations
Searching for 517.38947721793272 (position 137)
Found 514.38947721793272 in position 137 in 7 iterations
Searching for 480.00430310983609 (position 127)
Found 477.00430310983609 in position 129 in 7 iterations
Searching for 326.21442304757835 (position 82)
Found 322.21442304757835 in position 83 in 6 iterations
Searching for 645.95684682760088 (position 162)
Found 641.95684682760088 in position 164 in 8 iterations
Searching for 950.56920072023684 (position 239)
Found 947.56920072023684 in position 239 in 4 iterations
Searching for 825.06790368358406 (position 212)
Found 825.06790368358406 in position 212 in 8 iterations
Searching for 801.96746726889853 (position 206)
Found 797.96746726889853 in position 208 in 8 iterations
Searching for 598.49137852107299 (position 154)
Found 597.49137852107299 in position 155 in 6 iterations
Searching for 182.50276192510759 (position 49)
Found 178.50276192510759 in position 52 in 8 iterations
Searching for 910.81780449842836 (position 229)
Found 909.81780449842836 in position 229 in 7 iterations
Searching for 178.50276192510759 (position 49)
Found 178.50276192510759 in position 49 in 7 iterations
Searching for 193.48023926511428 (position 54)
Found 192.48023926511428 in position 54 in 8 iterations
Searching for 10.19525742362743 (position 2)
Found 6.1952574236274298 in position 3 in 6 iterations
Searching for 27.130954924161504 (position 7)
Found 27.130954924161504 in position 7 in 5 iterations
Searching for 441.32816553239542 (position 120)
Found 441.32816553239542 in position 120 in 8 iterations
Searching for 394.19336527603991 (position 103)
Found 392.19336527603991 in position 104 in 8 iterations
Searching for 791.80480361339153 (position 202)
Found 787.80480361339153 in position 203 in 6 iterations
Searching for 911.36002685628841 (position 228)
Found 909.36002685628841 in position 229 in 7 iterations
Searching for 581.5159764397107 (position 151)
Found 579.5159764397107 in position 151 in 5 iterations
Searching for 660.84780419324318 (position 173)
Found 660.84780419324318 in position 173 in 7 iterations
Searching for 145.35254982146674 (position 38)
Found 144.35254982146674 in position 38 in 8 iterations
Searching for 969.23734244819479 (position 248)
Found 969.23734244819479 in position 248 in 8 iterations
Searching for 961.31171605578777 (position 242)
Found 958.31171605578777 in position 243 in 6 iterations
Searching for 642.95684682760088 (position 162)
Found 641.95684682760088 in position 162 in 8 iterations
Searching for 392.74477370525221 (position 102)
Found 388.74477370525221 in position 103 in 5 iterations
Searching for 93.426343577379683 (position 21)
Found 90.426343577379683 in position 22 in 8 iterations
Searching for 909.81780449842836 (position 229)
Found 909.81780449842836 in position 229 in 7 iterations
Searching for 421.13409833063753 (position 111)
Found 418.13409833063753 in position 111 in 4 iterations
Searching for 54.386059144871361 (position 13)
Found 50.386059144871361 in position 13 in 7 iterations
Searching for 899.67143772698148 (position 225)
Found 897.67143772698148 in position 225 in 7 iterations
Searching for 222.16763817255165 (position 61)
Found 221.16763817255165 in position 61 in 7 iterations
Searching for 331.21738334299755 (position 84)
Found 331.21738334299755 in position 84 in 8 iterations
Searching for 335.21738334299755 (position 84)
Found 331.21738334299755 in position 87 in 5 iterations
Searching for 324.61827448347424 (position 83)
Found 323.61827448347424 in position 83 in 6 iterations
Searching for 716.20126956999422 (position 188)
Found 715.20126956999422 in position 190 in 8 iterations
Searching for 144.33121738334302 (position 37)
Found 141.33121738334302 in position 37 in 7 iterations
Searching for 660.14795373393963 (position 168)
Found 656.14795373393963 in position 171 in 6 iterations
Searching for 208.93890194402906 (position 57)
Found 205.93890194402906 in position 57 in 7 iterations
Searching for 593.89019440290531 (position 153)
Found 593.89019440290531 in position 153 in 7 iterations
Searching for 787.87502670369577 (position 200)
Found 784.87502670369577 in position 202 in 8 iterations
Searching for 147.24469740897854 (position 39)
Found 146.24469740897854 in position 39 in 5 iterations
Searching for 678.17520676290167 (position 181)
Found 677.17520676290167 in position 181 in 7 iterations
Searching for 998.74163029877616 (position 254)
Found 997.74163029877616 in position 255 in 9 iterations
Searching for 699.79149754325999 (position 184)
Found 695.79149754325999 in position 184 in 8 iterations
Searching for 150.24469740897854 (position 39)
Found 146.24469740897854 in position 40 in 8 iterations
Searching for 714.77401043733028 (position 187)
Found 714.77401043733028 in position 187 in 6 iterations
Searching for 516.38947721793272 (position 137)
Found 514.38947721793272 in position 137 in 7 iterations
Searching for 497.91866817224644 (position 132)
Found 494.91866817224644 in position 132 in 8 iterations
Searching for 663.00750755333104 (position 175)
Found 662.00750755333104 in position 175 in 4 iterations
Searching for 578.77642139957891 (position 149)
Found 577.77642139957891 in position 149 in 7 iterations
Searching for 793.84954374828328 (position 203)
Found 789.84954374828328 in position 204 in 8 iterations
Searching for 323.71190527054659 (position 80)
Found 319.71190527054659 in position 83 in 6 iterations
Searching for 979.27082125308993 (position 251)
Found 978.27082125308993 in position 251 in 6 iterations
Searching for 360.4256721701712 (position 95)
Found 356.4256721701712 in position 95 in 3 iterations
Searching for 791.95532090212714 (position 204)
Found 791.95532090212714 in position 204 in 8 iterations
Searching for 325.61827448347424 (position 83)
Found 323.61827448347424 in position 83 in 6 iterations
Searching for 997.99496444593649 (position 253)
Found 994.99496444593649 in position 254 in 8 iterations
Searching for 540.79717398602259 (position 139)
Found 537.79717398602259 in position 141 in 7 iterations
Searching for 501.13733329264198 (position 133)
Found 500.13733329264198 in position 134 in 8 iterations
Searching for 340.86330759605698 (position 88)
Found 336.86330759605698 in position 91 in 6 iterations
Searching for 839.77782525101475 (position 214)
Found 839.77782525101475 in position 214 in 8 iterations
Searching for 349.44611957152011 (position 94)
Found 346.44611957152011 in position 94 in 8 iterations
Searching for 383.97677541428874 (position 100)
Found 382.97677541428874 in position 100 in 8 iterations
Searching for 230.68230231635485 (position 62)
Found 229.68230231635485 in position 63 in 2 iterations
Searching for 362.13263344218268 (position 97)
Found 362.13263344218268 in position 97 in 7 iterations
Searching for 429.22861415448472 (position 114)
Found 427.22861415448472 in position 115 in 6 iterations
Searching for 791.84954374828328 (position 203)
Found 789.84954374828328 in position 203 in 6 iterations
Searching for 443.32816553239542 (position 120)
Found 441.32816553239542 in position 121 in 7 iterations
Searching for 661.09402752769552 (position 169)
Found 657.09402752769552 in position 173 in 7 iterations
Searching for 182.71639149143957 (position 50)
Found 178.71639149143957 in position 52 in 8 iterations
Searching for 563.93731498153636 (position 146)
Found 561.93731498153636 in position 146 in 8 iterations
Searching for 540.29969176305428 (position 141)
Found 540.29969176305428 in position 141 in 7 iterations
Searching for 420.13409833063753 (position 110)
Found 418.13409833063753 in position 111 in 4 iterations
Searching for 366.13263344218268 (position 97)
Found 362.13263344218268 in position 97 in 7 iterations
Searching for 483.0490432447279 (position 129)
Found 479.0490432447279 in position 130 in 8 iterations
Searching for 587.92486342967004 (position 152)
Found 585.92486342967004 in position 152 in 8 iterations
Searching for 649.0768456068605 (position 165)
Found 646.0768456068605 in position 165 in 7 iterations
Searching for 568.59953611865603 (position 147)
Found 565.59953611865603 in position 147 in 6 iterations
Searching for 842.77782525101475 (position 214)
Found 839.77782525101475 in position 214 in 8 iterations
Searching for 788.99710074159975 (position 201)
Found 784.99710074159975 in position 202 in 8 iterations
Searching for 970.03289895321507 (position 244)
Found 966.03289895321507 in position 249 in 7 iterations
Searching for 669.89046906949068 (position 177)
Found 666.89046906949068 in position 178 in 8 iterations
Searching for 764.82436597796561 (position 197)
Found 764.82436597796561 in position 197 in 7 iterations
Searching for 910.94906460768459 (position 227)
Found 906.94906460768459 in position 229 in 7 iterations
Searching for 216.10364085818048 (position 58)
Found 212.10364085818048 in position 60 in 8 iterations
Searching for 446.80727561265911 (position 122)
Found 444.80727561265911 in position 123 in 6 iterations
Searching for 964.17334513382366 (position 243)
Found 960.17334513382366 in position 243 in 6 iterations
Searching for 717.20126956999422 (position 188)
Found 715.20126956999422 in position 190 in 8 iterations
Searching for 11.169774468214973 (position 4)
Found 11.169774468214973 in position 4 in 8 iterations
Searching for 995.99496444593649 (position 253)
Found 994.99496444593649 in position 253 in 7 iterations
Searching for 976.47230445265052 (position 250)
Found 972.47230445265052 in position 250 in 8 iterations

I don't believe that a linear search outperforms a binary search even with only 256 elements, anyway that is a lot of fun too.  :biggrin:

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #14 on: January 30, 2019, 10:11:36 PM »
Many Tks AW  :t :t :t. I´ll take a look :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com