Author Topic: Fast Compare Real8 with SSE and ColorSpaces  (Read 6898 times)

jj2007

  • Member
  • *****
  • Posts: 9686
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #15 on: January 30, 2019, 10:30:24 PM »
given a table of 256 Real8 values from 0 to 100 (And all in crescent order from 0 to 100 only)

I've given it a try for the naive scan method:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10000000 random searches performed in 1157 ms

The source:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  Dim table() As REAL8          ; create an array
  Let edi=New$(256*DWORD)       ; and a DWORD table
  push edi                      ; we need it later
  Rand()
  For_ ecx=0 To 255
        Rand(0, 100, table(ecx))
  Next
  ArraySort table()             ; sort double array ascending
  Open "O", #1, "TheTable.dat"  ; save the array to disk
  ArrayStore #1, table()
  Close
  For_ ecx=0 To 255
        movlps xmm0, table(ecx) ; get a value from the table
        .if ecx<5 || ecx>=255-4
                .if Zero?
                                PrintLine "..."
                .endif
                Print Str$(ecx), Str$("\t%9f\n", f:xmm0)        ; print the first and last values
        .endif
        movd eax, xmm0
        stosd                   ; store the double's low DWORD
  Next
  pop edi
  loops=10000000
  push loops-1
  PrintCpu 0
  NanoTimer()
  .Repeat
        mov ecx, 256
        mov ebx, Rand(ecx)      ; test a random search position
        mov eax, DWORD PTR table(ebx)
        push edi
        push ecx
        repnz scasd             ; this is the search algo ;-)
        pop edx
        sub ecx, edx
        not ecx
        pop edi
        .if ecx!=ebx            ; check if the found position is the expected one
                Print Str$("\n## Mismatch at pos %i", ecx), Str$("<>%i ##\n\n", ebx)
                .Break
        .endif
        dec stack
  .Until Sign?
  Print Str$("%i random searches performed in ", loops), NanoTimer$()
EndOfCode


The binary search could be faster but also trickier. You would have to compare the HIGH dwords, and they are much more likely to have duplicates.

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #16 on: January 31, 2019, 06:34:41 PM »
Tks a lot, Jochen and AW.

JJ, I´ll give more tests but it seems that your idea on using the low dword worked. I also came up with a routine biased on your´s and on AW, can you please benchmark this for me and see if it is working ?

Compare Double fpu (the low dword, in fact).
Code: [Select]

Proc FastCompare4_Double:
    Arguments @pTable, @Value, @Size
    Uses esi, ebx, ecx, edi

        mov edi D@Size
        xor eax eax
        shl edi 1 ; always mul by 2 at the start. A Real8 is double the size of a dword.
        mov esi D@pTable
L1:
        ; half = edi
        shr edi 1
        lea ecx D$edi+eax | mov ebx D$esi+ecx*4+4; Otherlow = low+half

        If D@Value = ebx
            mov eax ecx ; Commented out the inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX
            shr eax 1
            ExitP
        End_If
        cmovg eax ecx
       
        test edi edi | jne L1<
EndP

___________________________________

Example of usage:

[TestingTable1: D$ 0, 0
TestingTable2: D$ 0, 1
TestingTable3: D$ 0, 2
TestingTable4: D$ 0, 3
TestingTable5: D$ 0, 4
TestingTable6: D$ 0, 5
TestingTable7: D$ 0, 6
TestingTable8: D$ 0, 7
TestingTable9: D$ 0, 8
....
TestingTable255: D$ 0, 254
TestingTable256: D$ 0, 255]

call FastCompare4_Double TestingTable1, 177, 256

Compare Dwords only

Code: [Select]
Proc FastCompare4:
    Arguments @pTable, @Value, @Size
    Uses esi, ebx, ecx, edi


    mov esi D@pTable
    xor eax eax ; eax = low
     mov edi D@Size
L1:
        ; half = edi
        shr edi 1
        lea ecx D$edi+eax | mov ebx D$esi+ecx*4; Otherlow = low+half

        If D@Value = ebx
            mov eax ecx; Commented out the inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX
            ExitP
        End_If
        cmovg eax ecx

        test edi edi | jne L1<

EndP

_______________________________________

Example of usage:

[MyGugaTestTable1: D$ 0
MyGugaTestTable2: D$ 1
MyGugaTestTable3: D$ 2
MyGugaTestTable4: D$ 3
MyGugaTestTable5: D$ 4
MyGugaTestTable6: D$ 5
MyGugaTestTable7: D$ 6
MyGugaTestTable8: D$ 7
MyGugaTestTable9: D$ 8
(...)
MyGugaTestTable254: D$ 253
MyGugaTestTable255: D$ 254
MyGugaTestTable256: D$ 255]

call FastCompare4 MyGugaTestTable1, 177, 256


I hope that those functions are faster then the logarithm computation used on the original function :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9686
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #17 on: January 31, 2019, 06:47:04 PM »
can you please benchmark this

I'm afraid my assembler won't accept the syntax... do you have a translator?

Btw that table of REAL8s - is that always the same numbers, or are they randomly created? If they are the same, testing once if there are duplicates would be sufficient. If not, there is a hypothetical chance to find a duplicate, and that would require one extra run.

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #18 on: January 31, 2019, 11:19:59 PM »
Hi JJ. Tks :)


I believe the syntax is like this

Code: [Select]

pTable = Pointer to the table of 256 Real8
Value = The value to be found. Any Integer
Size = Integer. Size of the table. Dummy, i know...since we are testing only 256 Real8, but i did like that to remember later of what i was doing :)

FastCompare4_Double proc pTable: Dword, Value:Dword, Size:Dword

                push    ebp
               mov     ebp, esp
                push    esi
                push    ebx
                push    ecx
                push    edi
                mov     edi, Size ; The size of a Real8 is always 2x the size of an integer. So, we need to double it´s size here so we can compare the value with only it´s low dword later on mov ebx [esi+ecx*4+4]
                xor     eax, eax
                shl     edi, 1
                mov     esi, pTable ; or.. offset pTable (I forgot the masm syntax. But here is a pointer not the value itself.

loc_45AEA1:
                shr     edi, 1
                lea     ecx, [eax+edi]
                mov     ebx, [esi+ecx*4+4]
                cmp     Value, ebx ; The valuye to be found is compared to the Low Dword from our Real8 table
                jnz     short loc_45AEB8
                mov     eax, ecx
                shr     eax, 1 ; Note: We are retrieving only the index and not the actual pos. For Real8 computations, divide the result by 2 to it retrieve the proper index. To retrieve the actual pos simply increase this result by 1
                jmp     loc_45AEC4 ; Since we found the value we can exit the function :)

loc_45AEB8:
                cmovg   eax, ecx
                test    edi, edi
                jnz     short loc_45AEA1
                mov     eax, 0FFFFFFFFh ; Nothing was found. Return -1

loc_45AEC4:
                pop     edi
                pop     ecx
                pop     ebx
                pop     esi
                mov     esp, ebp
                pop     ebp
                retn    0Ch
FastCompare4_Double endp

; ---------------------------------------------------------------------------



About the Real8 values...they are all the same. They are created from a computation of gamma and offset, whoose values are fixed for every color model, with the exception of only few models (sRGB HDTV and Adobe RGB 1998) that uses these, respectivelly: Gamma: 2.19921875, Offset: 0.055 and Gamma: 2.2, Offset: 0.099). All other models uses the same values for gamma and offset (2.2 for gammaa and 0.055 for offset. Other few models uses values as 1.8 for gamma).

The resultant values of Real8 (i named them as "Kfactor") on the table are given by the following equation:

ColorNormnalized = Color/255 ; The table of 256 Real8 is created using a range of integers from Color = 0 to 255. I mean, i created a function and insert integer values (not normalized) from 0 to 255 to it compute the Kfactor values for each one of the 256 colors.

KFactor = [ColorNormalized+Offset)/(1+Offset))^Gamma]  * 100 ; If ColorNormalized (Color/255) is bigger then Treshold
KFactor = (ColorNormalized/Slope)  * 100 ; If ColorNormalized (Color/255) is smaller or equal to Treshold

Threshold  and slope are calculated from Gamma and offset like this:

Threshold  = Offset/(Gamma-1)
Slope = [Offset * ( (Gamma-1)^(Gamma-1) )] * [(Offset+1)/(Offset*Gamma)]^Gamma

The only fixed values are, offset and gamma that the user must insert to create threshold and slope for the KFactor computation.
« Last Edit: February 01, 2019, 02:08:23 AM by guga »
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #19 on: February 01, 2019, 01:02:41 AM »
Note:

The values of gamma varies from 1.1 to 2.4
Offset is a tiny value. 0.055 (or 0.099 for HDTV only). The values seems to be x/1000 where x >0 < 100, but i´m not sure how they are achieved. I stablished the limits for gamma to avoid division by zero. But for offset i´ll take a look later. In cases where offset are 0 the correct would result in a slope of 1, if i assume that 0/0 = 1. But i don´t remember if for gamma and offset it do exists a formula to retrieve those values or they are only made by the user/manufacturer of a monitor for example.

The formulas can be found here:
https://en.wikipedia.org/wiki/SRGB
http://www.marcelpatek.com/color.html
http://hem.bredband.net/egostudios/gamma_management.htm
http://www.marcelpatek.com/gamma.html
https://web.archive.org/web/20130718042406/http://www.arcsynthesis.org/gltut/Illumination/Tut12%20Monitors%20and%20Gamma.html

Marcel Patek´s  formulas seems incorrect as mentioned earlier.
« Last Edit: February 01, 2019, 02:05:58 AM by guga »
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9686
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #20 on: February 01, 2019, 03:03:44 AM »
Hi Guga,

I've tried my luck but your algo returns always -1 (and you haven't been generous with comments). This is supposed to be binary search, right? What's wrong? Does it work on your end?

The parameters are passed correctly:
Code: [Select]
mov edi, _Size ; 100h, ok
xor eax, eax
mov esi, pTable ; OK
; ^ ^ at this point: edi=100h, esi points to REAL8 values ^ ^

Code: [Select]
option prologue:none
option epilogue:none
if 0
  pTable = Pointer to the table of 256 Real8
  Value = The value to be found. Any Integer
  Size = Integer. Size of the table. Dummy, i know...since we are testing only 256 Real8, but i did like that to remember later of what i was doing :)
endif
;   invoke FastCompare4_Double, addr table(0), DWORD PTR table(ebx), ecx
FastCompare4_Double proc pTable: Dword, Value:Dword, _Size:Dword
push ebp
mov ebp, esp
push esi
push ebx
push ecx
push edi
mov edi, _Size ; 100h, ok
xor eax, eax
shl edi, 1
mov esi, pTable ; OK or.. offset pTable (I forgot the masm syntax. But here is a pointer not the value itself.
loc_45AEA1:
shr edi, 1
lea ecx, [eax+edi]
mov ebx, [esi+ecx*4+4]
cmp Value, ebx ; or.. offset Value (I forgot the masm syntax. But here is a pointer not the value itself.
jnz short loc_45AEB8
mov eax, ecx
shr eax, 1
jmp loc_45AEC4

loc_45AEB8:
cmovg eax, ecx
test edi, edi
jnz short loc_45AEA1
mov eax, 0FFFFFFFFh

loc_45AEC4:
pop edi
pop ecx
pop ebx
pop esi
mov esp, ebp
pop ebp
retn 0Ch
FastCompare4_Double endp
OPTION PROLOGUE:PrologueDef
OPTION PROLOGUE:PrologueDef

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #21 on: February 01, 2019, 04:35:55 AM »
Oops..i forgot to reinsert the comments. Sorry  :redface: :redface: :redface: :redface:I removed them after optimizing. Here is the non optimized version with comment.

Here both are working, but only when it finds a match of the exact value. When it tries to find a value in between 2 it returns also -1

For instance, when the table is
[0, 5, 9, 12]
If the value to be found is 7, it will return -1.

I forgot to adapt the routine on the cases for the next value be bigger. Ex: 7 is in between 5 and 9. So it must return index 1

Here is the non optimized masm version. (For the exact match, i mean, and not the other to searching for values in between (i´m working on it, right now and see if i can be able to use the cmovge/cmovle as well to optimize:) )

Code: [Select]
FastCompare3_Double proc near
Probe           = dword ptr -1Ch
mid             = dword ptr -14h
Low             = dword ptr -10h
half            = dword ptr -8
CurSize         = dword ptr -4
pTable          = dword ptr  8
MyValue         = dword ptr  0Ch
Size            = dword ptr  10h

                push    ebp
                mov     ebp, esp
                sub     esp, 1Ch
                push    esi
                push    ecx
                push    edx
                mov     [ebp+Low], 0 ; Start LowVariable = 0
                mov     eax, [ebp+Size]
                shl     eax, 1
                mov     [ebp+CurSize], eax ; The Size always is mul by 2 because Real8 is 2x bigger then a Dword. So, current size starts with 512  (256*2)
                mov     esi, [ebp+pTable]  ; Pointer to our table of real8

loc_45AF7E:                             ; CODE XREF: FastCompare3_Double+6A↓j
                cmp     [ebp+CurSize], 0  ; On each loop the current size is being decreased. Size is used as a counter.
                jbe     loc_45AFCF
                mov     ecx, [ebp+CurSize]
                shr     ecx, 1
                mov     [ebp+half], ecx  ; half = cursize/2. half means searching for half of the table
                mov     ecx, [ebp+Low]
                add     ecx, [ebp+half]
                mov     [ebp+mid], ecx ; Mid is what will return. Here it is Mid = Low+CurSize/2
                mov     eax, [esi+ecx*4+4] ; esi will point to the middle of the table related to the half of the current position scnanned
                mov     [ebp+Probe], eax   ; The value from the table to check
                mov     eax, [ebp+MyValue] ; The inputed value
                cmp     eax, [ebp+Probe]
                jnz     short loc_45AFB4
                mov     eax, [ebp+mid] ; If MyValue is equal to the one being checked, we found the match.
                shr     eax, 1         ; since we are working with Real8, the original size was doubled. So we need to divide back by 2 to get the proper index. It will return the index. To return the pos, simply add 1 after the shr eax...
                jmp     loc_45AFCF     ; Exit our loop
; ---------------------------------------------------------------------------
;                jmp     short loc_45AFC1 ; Left over by RosAsm Else_If macro. Can be commented out
; ---------------------------------------------------------------------------

loc_45AFB4:                             ; CODE XREF: FastCompare3_Double+46↑j
                cmp     eax, [ebp+Probe] ; If the value from the table is bigger then My Value, it means we advanced the position. We lost he spot in forwards. Ex: MyValue = 5. Table ..... 2, 3, 5, 6 (6 = Probe). So Probe is forward the value to be found
                jbe     short loc_45AFBE
                mov     eax, [ebp+mid]    ; If we are forward it, get back to the previous position.
                jmp     short loc_45AFC1
; ---------------------------------------------------------------------------

loc_45AFBE:                             ; CODE XREF: FastCompare3_Double+57↑j
                mov     eax, [ebp+Low]  ; If we are here, it means that MyValue is located before the Probe. So, we are backwards. Ex:  MyValue = 5. Table ..... 2, 3 (Probe), 5, 6 . So Probe is before the value to be found

loc_45AFC1:                             ; CODE XREF: FastCompare3_Double+52↑j
                                        ; FastCompare3_Double+5C↑j
                mov     [ebp+Low], eax   ; Save the new position
                mov     eax, [ebp+half]  ; Current size is always the half of what is beeing scanned. Perform the look if we are not over (So, if Curize is bigger then 0)
                mov     [ebp+CurSize], eax
                jmp     loc_45AF7E
; ---------------------------------------------------------------------------

loc_45AFCF:                             ; CODE XREF: FastCompare3_Double+22↑j
                                        ; FastCompare3_Double+4D↑j
                pop     edx
                pop     ecx
                pop     esi
                mov     esp, ebp
                pop     ebp
                retn    0Ch
FastCompare3_Double endp

To use the function you can do this:

Code: [Select]
Myreal8Table    dd 3 dup(0), 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0
dd 8, 0, 9, 0, 0Ah, 0, 0Bh, 0, 0Ch, 0, 0Dh, 0, 0Eh, 0
dd 0Fh, 0, 10h, 0, 11h, 0, 12h, 0, 13h, 0, 14h, 0, 15h
dd 0, 16h, 0, 17h, 0, 18h, 0, 19h, 0, 1Ah, 0, 1Bh, 0, 1Ch
dd 0, 1Dh, 0, 1Eh, 0, 1Fh, 0, 20h, 0, 21h, 0, 22h, 0, 23h
dd 0, 24h, 0, 25h, 0, 26h, 0, 27h, 0, 28h, 0, 29h, 0, 2Ah
dd 0, 2Bh, 0, 2Ch, 0, 2Dh, 0, 2Eh, 0, 2Fh, 0, 30h, 0, 31h
dd 0, 32h, 0, 33h, 0, 34h, 0, 35h, 0, 36h, 0, 37h, 0, 38h
dd 0, 39h, 0, 3Ah, 0, 3Bh, 0, 3Ch, 0, 3Dh, 0, 3Eh, 0, 3Fh
dd 0, 40h, 0, 41h, 0, 42h, 0, 43h, 0, 44h, 0, 45h, 0, 46h
dd 0, 47h, 0, 48h, 0, 49h, 0, 4Ah, 0, 4Bh, 0, 4Ch, 0, 4Dh
dd 0, 4Eh, 0, 4Fh, 0, 50h, 0, 51h, 0, 52h, 0, 53h, 0, 54h
dd 0, 55h, 0, 56h, 0, 57h, 0, 58h, 0, 59h, 0, 5Ah, 0, 5Bh
dd 0, 5Ch, 0, 5Dh, 0, 5Eh, 0, 5Fh, 0, 60h, 0, 61h, 0, 62h
dd 0, 63h, 0, 64h, 0, 65h, 0, 66h, 0, 67h, 0, 68h, 0, 69h
dd 0, 6Ah, 0, 6Bh, 0, 6Ch, 0, 6Dh, 0, 6Eh, 0, 6Fh, 0, 70h
dd 0, 71h, 0, 72h, 0, 73h, 0, 74h, 0, 75h, 0, 76h, 0, 77h
dd 0, 78h, 0, 79h, 0, 7Ah, 0, 7Bh, 0, 7Ch, 0, 7Dh, 0, 7Eh
dd 0, 7Fh, 0, 80h, 0, 81h, 0, 82h, 0, 83h, 0, 84h, 0, 85h
dd 0, 86h, 0, 87h, 0, 88h, 0, 89h, 0, 8Ah, 0, 8Bh, 0, 8Ch
dd 0, 8Dh, 0, 8Eh, 0, 8Fh, 0, 90h, 0, 91h, 0, 92h, 0, 93h
dd 0, 94h, 0, 95h, 0, 96h, 0, 97h, 0, 98h, 0, 99h, 0, 9Ah
dd 0, 9Bh, 0, 9Ch, 0, 9Dh, 0, 9Eh, 0, 9Fh, 0, 0A0h, 0
dd 0A1h, 0, 0A2h, 0, 0A3h, 0, 0A4h, 0, 0A5h, 0, 0A6h, 0
dd 0A7h, 0, 0A8h, 0, 0A9h, 0, 0AAh, 0, 0ABh, 0, 0ACh, 0
dd 0ADh, 0, 0AEh, 0, 0AFh, 0, 0B0h, 0, 0B1h, 0, 0B2h, 0
dd 0B3h, 0, 0B4h, 0, 0B5h, 0, 0B6h, 0, 0B7h, 0, 0B8h, 0
dd 0B9h, 0, 0BAh, 0, 0BBh, 0, 0BCh, 0, 0BDh, 0, 0BEh, 0
dd 0BFh, 0, 0C0h, 0, 0C1h, 0, 0C2h, 0, 0C3h, 0, 0C4h, 0
dd 0C5h, 0, 0C6h, 0, 0C7h, 0, 0C8h, 0, 0C9h, 0, 0CAh, 0
dd 0CBh, 0, 0CCh, 0, 0CDh, 0, 0CEh, 0, 0CFh, 0, 0D0h, 0
dd 0D1h, 0, 0D2h, 0, 0D3h, 0, 0D4h, 0, 0D5h, 0, 0D6h, 0
dd 0D7h, 0, 0D8h, 0, 0D9h, 0, 0DAh, 0, 0DBh, 0, 0DCh, 0
dd 0DDh, 0, 0DEh, 0, 0DFh, 0, 0E0h, 0, 0E1h, 0, 0E2h, 0
dd 0E3h, 0, 0E4h, 0, 0E5h, 0, 0E6h, 0, 0E7h, 0, 0E8h, 0
dd 0E9h, 0, 0EAh, 0, 0EBh, 0, 0ECh, 0, 0EDh, 0, 0EEh, 0
dd 0EFh, 0, 0F0h, 0, 0F1h, 0, 0F2h, 0, 0F3h, 0, 0F4h, 0
dd 0F5h, 0, 0F6h, 0, 0F7h, 0, 0F8h, 0, 0F9h, 0, 0FAh, 0
dd 0FBh, 0, 0FCh, 0, 0FDh, 0, 0FEh, 0, 0FFh


_________________________________________________________________

push    256
push    253
push    offset Myreal8Table
call    FastCompare3_Double


Sorry for the assembly listing. I tried to port to masm, but i forgot the proper syntax, so i had to disassemble the function to you see.

In RosAsm the function looks like:

Code: [Select]
Proc FastCompare3_Double:
    Arguments @pTable, @Value, @Size
    Local @CurSize, @half, @Other_Half, @Low, @mid, @Probe
    Uses esi, ecx, edx

    mov D@Low 0
    mov eax D@Size | shl eax 1 | mov D@CurSize eax
    mov esi D@pTable
    ..While D@CurSize > 0

        mov ecx D@CurSize | shr ecx 1 | mov D@half ecx ; half = cursize/2
        mov ecx D@Low | add ecx D@half | mov D@Mid ecx | mov eax D$esi+ecx*4+4 | mov D@Probe eax

        mov eax D@Value
        If eax = D@Probe
            mov eax D@Mid; | inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX
            shr eax 1
            ExitP
        Else_If eax > D@Probe
            mov eax D@Mid
        Else
            mov eax D@low
        End_If
        mov D@Low eax

        mov eax D@half | mov D@CurSize eax

    ..End_While

EndP


The cmovge is a replacement to the routines in the Else_If macro.  On this example, we are computing the exact value. If the value on the table is > MYValue we use Mid as the checking value. Otherwise we use the previous "low". So we exchange the value according to the condition on the jmp (Greater then on this case).

I never used cmovge before. But it seems to work as expected. This algorithm i ported and adapted from:
https://academy.realm.io/posts/how-we-beat-cpp-stl-binary-search

I´ll try to fix to it handles  the cases of a value to be found in between the previous and the next one. Probably will need another cmovge or cmovle. I[´ll see if i can fix.

So far, the currrent version is working here for exact match only.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9686
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #22 on: February 01, 2019, 11:48:35 AM »
Sorry, but I can't get it to work, Guga. In the meantime, I have designed a binary search algo, using the hiwords of the REAL8s. It is about twice as fast as repne scasd:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
1750000 random searches performed in 191 ms - repnz scasd
1750000 random searches performed in 97 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 96 ms
1750000 random searches performed in 96 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 93 ms

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #23 on: February 02, 2019, 12:21:45 AM »
Tks a lot, JJ :t :t :t

Your version seems to work as expected including the values in between the limits :)

I tried to port the cmovge to masmbasic, but it seems to contains lots of errors while retrieving the values. A mismatch is being caused by a error of 1 or 2 missed values. So, when it was supposed to return 182 it returned 181 etc. Also your version seems to be faster. :t :t :t  But, if you want to try, this is the ported code. (Have lots of mismatches)

Code: [Select]
BinSearchP proc uses esi edi ebx edx ecx pTable, Val8:REAL8, numel
  mov edi, numel ; eax = size
  dec edi
  xor eax, eax
  shl edi, 1
  mov esi, pTable ; left

NotDone:

shr edi, 1
lea ecx, [edi+eax]
mov ebx, [esi+8*ecx+4]
cmp dword ptr Val8[4], ebx
cmovge eax, ecx

  test edi, edi
jne NotDone

cmp eax, dword ptr Val8[4]
jnae Exiting
     dec eax
Exiting:
  ret
BinSearchP endp

I gave a try and inserted your version on my CieLCH routine for colorize images and the result seems accurate. It do occurs some clippings on the Chroma yet, but it is most likely due to the limitation of the ratios between hue and chroma for a given Luminosity. I´ll see how to fix that later.



The binary search routine you did works like a charm. Now, i´ll need to figure it out how to optimize a routine for sin, cosine and atan2 (arctang using atan2 function) computations. Optimizing those trigonometry functions willl make the convertion between RGBtoCieLCh/CieLChtoRGB works, at least, 2 to 4x faster then what it is actually. :icon_cool: :icon_cool: :icon_cool:

If you have somee examples of faster (and accurate at the same time) sin, cossine and atan2 fucntions, please post it so i can take a look and see if i can adapt to it those color convertion routines 8) 8) 8)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9686
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #24 on: February 02, 2019, 01:39:46 AM »
Thanks, Guga. There seems to be a little bug in your algo, but right now I have no time to chase it. But I attach a version that has both algos implemented. You just need to change the string on top:

BinSearch$ equ BinSearchJ       ; use BinSearchG to test Guga's version

My version is not foolproof yet, either: search the source for seed, there is one that produces in pos 5/6 two very close numbers. Since the algo checks only the high DWORD of the REAL8, the comparison depends on a handful of bits of the mantissa only. That works most of the time, and it works always if the table is a known one, but for a random table it may create mismatches. It may not even be a problem, though. I would have to see concrete applications of such an algo. It is indeed fast, maybe that's the only thing that counts.

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #25 on: February 02, 2019, 03:52:37 AM »
Tks JJ.

The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table as you did with:
Code: [Select]
  cmp ebx, [esi] ; <left?
  js @err
  cmp ebx, [edx] ; >right?
  jg @err

But, i was amazed it was also fast (despite the errors).

Quote
My version is not foolproof yet, either: search the source for seed, there is one that produces in pos 5/6 two very close numbers.
Maybe another final check for the Loword of the HIDWORD should make the trick to fix. It would work as the original verson but immediatelly after

Code: [Select]
  sub edx, pTable ; middle pos - original left
  sar edx, 3
  xchg eax, edx

So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.


Quote
I would have to see concrete applications of such an algo. It is indeed fast, maybe that's the only thing that counts.

Well...i can think on a "few" pratical applications where this couldd be used :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:  A faster binary search (for Real8 on this particular case and this situations of scanning on a ordered list) can be of extreme use for video, image and audio processing.  On the functions i´m creating for image processing, such algorithm is a must, speciaqlly because colorpaces functions makes heavy usage of complex mathematical equations that could be easily replaced by simple pointers to tables  and here iss where a faster binary search can be used.

For example, on my original version of CIELCHtoRGB, the part that actually converts the Value (In KFactor) to Red, Green or Blue colors are given by the following formula:

FinalColor = ((TempColor+0.055)/1.055)^2.4

The final color is computed also after checking for the threshold as i mentioned earlier


KFactor = [ColorNormalized+Offset)/(1+Offset))^Gamma]  * 100 ; If ColorNormalized (Color/255) is bigger then Treshold
KFactor = (ColorNormalized/Slope)  * 100 ; If ColorNormalized (Color/255) is smaller or equal to Treshold

On my original version i used the same way people often uses to calculate this stuff. So i inserted the formula on the final part of the color convertions routines such as:

Code: [Select]
Proc CieLCHtoRGB:
    Arguments @pLuminance, @pChroma, @pHue, @Red, @Green, @Blue, @Flag, @WhiteRef
    Structure @TempStorage 64, @pXDis 0, @pYDis 8, @pZDis 16, @TmpRedDis, 24, @TmpGreenDis 32, @TmpBlueDis 40, @pAFactorDis 48, @pBFactorDis 56
    Uses esi, ebx, ecx, edx, edi

    finit

(.....)

    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+FloatMatrices.M1Dis | fld R@pYDis | fmul R$esi+FloatMatrices.M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+FloatMatrices.M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+FloatMatrices.M4Dis | fld R@pYDis | fmul R$esi+FloatMatrices.M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+FloatMatrices.M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+FloatMatrices.M7Dis | fld R@pYDis | fmul R$esi+FloatMatrices.M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+FloatMatrices.M9Dis | faddp ST1 ST0 | fstp R$ecx

    call GammaLinearDecodingEx F_gamma, F_Offset, F_Slope, F_Treshold, F_OffsetPlusOne, D@Flag

    lea ecx D@TmpRedDis
    .Fpu_If R$ecx > R$F_Treshold
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$F_gamma | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$F_OffsetPlusOne | fsub R$F_Offset
    .Fpu_Else
        fld R$ecx | fmul R$F_Slope
    .Fpu_End_If
    mov ecx D@Red | fmul R$Float255 | fistp F$ecx

    lea ecx D@TmpGreenDis
    .Fpu_If R$ecx > R$F_Treshold
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$F_gamma | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$F_OffsetPlusOne | fsub R$F_Offset
    .Fpu_Else
        fld R$ecx | fmul R$F_Slope
    .Fpu_End_If
    mov ecx D@Green | fmul R$Float255 | fistp F$ecx

    lea ecx D@TmpBlueDis
    .Fpu_If R$ecx > R$F_Treshold
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$F_gamma | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$F_OffsetPlusOne | fsub R$F_Offset
    .Fpu_Else
        fld R$ecx | fmul R$F_Slope
    .Fpu_End_If
    mov ecx D@Blue | fmul R$Float255 | fistp F$ecx

    ; clip values outside the range
    mov ecx D@Red
    If D$ecx <s 0
        mov D$ecx 0
    Else_If D$ecx > 255
        mov D$ecx 255
    End_If

    mov ecx D@Green
    If D$ecx <s 0
        mov D$ecx 0
    Else_If D$ecx > 255
        mov D$ecx 255
    End_If

    mov ecx D@Blue
    If D$ecx <s 0
        mov D$ecx 0
    Else_If D$ecx > 255
        mov D$ecx 255
    End_If


EndP

See the problem ? To achieve the final color we need to do a bunch of mathematical operations using power functions to retrieve the proper value.

What i did was revert the formula and use them as a table that is pre-calculated way before CieLCHtoRGB is actually running. I found ouyt that the resultant values that generates the color have  fixed fractions (i named them as Kfactor). And the total amount of fractions are only 256 ! So, one fraction per color.

All i had to do is create a routine to calculate all those values and insert them on a huge structure, which i named as "WSMatrix" (So far, the size of this structure is 4388 bytes, but i´ll probably increase it because i need 2 or 3 more tables). I created a function called SetupWorkSpaceMatrixDataEx that can be used when the application starts, for example or under user choice only once.

The function SetupWorkSpaceMatrixDataEx pre-calculate literraly everything needed to convert from RGB to CieLCH and vice-versa, inserting all data (gamma, offset, white references, creating new matrices, calculating multiplicand fractions, stablishing limits etc etc) and put all of that on a single structure WSMatrix to be used internally.

The usage of a table and the pre-calculation of all those values is a must, specially because whenever i´m analysing a image/video etc, i no longer need to make all those computations for every pixel. All it is needed is for CieLCHtoRGB convertion is basically points to the precalculated data from WSMatrix and do simple math operations (if needed).

It´s a major advantage in terms of speed computation because we are no longer calculating everything for each pixel. For example, all those monster computationss i replaced with a simple :

Code: [Select]
Proc CieLCHtoRGB_Ex:
    Arguments @pLuminance, @pChroma, @pHue, @Red, @Green, @Blue, @pMatrix

    finit
    mov edx D@pChroma
    lea edi D@pAFactorDis | mov esi D@pHue | fld R$esi | fmul R$Degree_Radian | fcos | fmul R$edx | fstp R$edi
    lea edi D@pBFactorDis | mov esi D@pHue | fld R$esi | fmul R$Degree_Radian | fsin | fmul R$edx | fstp R$edi

(...)

    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R$ecx

    lea ebx D$esi+WS_Matrix.KFactorMapDis

    mov edi D@Red
    fld R@TmpRedDis | fmul R$Float100 | fstp R@TmpColorCheckDis
    lea eax D@TmpColorCheckDis
    call BinarySearch ebx, D$eax+4, 256
    mov D$edi eax

    mov edi D@Green
    fld R@TmpGreenDis | fmul R$Float100 | fstp R@TmpColorCheckDis
    lea eax D@TmpColorCheckDis
    call BinarySearch ebx, D$eax+4, 256
    mov D$edi eax

    mov edi D@Blue
    fld R@TmpBlueDis | fmul R$Float100 | fstp R@TmpColorCheckDis
    lea eax D@TmpColorCheckDis
    call BinarySearch ebx, D$eax+4, 256
    mov D$edi eax

EndP


I can also gain a bit more speed using the BinarySearch function inline. But i choose to use as a call to a function to see first, if it was working  (and it is  :greenclp: :greenclp: :greenclp: :greenclp: )


Didn´t even need to benchmark, since the difference in speed when using binarysearch rather then calculating "Color = ((Color+0.055)/1.055)^2.4" all the time, is noticed on naked eyes :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:


If you have time and can fix the binarysearch for foolproof it will be very handy.

Also, i´ll try to see a way to optimize the sin, cosinee and atan2 computations as well. Since both convertions RGB to CieLCH and it´s reversal) makes heavy usage of trigonometry operations like those, an optimizations is also needed to speed things even more.

Just to you have a small idea on how the optimization can make things better. When i started all of those convertion routines, the images i´m posting here to transfom Gray to Color took something around 40 minutes to finish processing. After using the tables technique, the convertion was done in 4 to 8 seconds (Including the time to pre-calculate all of this). Now...with your´s optimization and the fixes i made on the convertion routines, it is taking something around 2 to 4 seconds including the precalculation routines to colorize a 640*480 grayscale image using a 960*720 reference color image.


This is to say that with further optimizations, i can reduce this total time of less then 1 second including the precalculations. So, since the precalculations are made when the app starts for example, it means that the total amount of time of coloring a image can be done in a few miliseconds.

Also, once i suceed to make the colorization method output a file formed by a kind of sample structure, then all of this can be done in almost notime. All is needed is retrieve the pointers from an external file, make the necessary adjustements to those pointers, and voialáa, you can convert a grayscale image onto a colored one almost immediatelly.

If i can be able to do this with image, the same process can be applied to video processing. For a video with let´s say 50.000 frames, if to colorize each frame it takes, let´s say 100 miliseconds per frame, it means that the video will be totally colorized in something around 15 minutes rather then years since the original version that took around 45 minutes to colorize 1 single image :greensml: :greensml: :greensml: :greensml:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9686
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Compare Real8 with SSE
« Reply #26 on: February 02, 2019, 04:26:11 AM »
The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table
No, that's not a problem since we only feed valid numbers. The check can be omitted.

Quote
So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.
Yes, something like that. But in most cases it's not necessary, since the numbers will be "distant" enough anyway. We are checking the high DWORD, i.e. the one that contains the exponent and part of the mantissa:

That "part" is 32-11-1=20 bits, or about 6-7 digits. Should be sufficient in most cases ;-)

daydreamer

  • Member
  • ****
  • Posts: 907
  • watch Chebyshev on the backside of the Moon
Re: Fast Compare Real8 with SSE
« Reply #27 on: February 03, 2019, 02:54:13 AM »
good work
I think there is a fast chebyshev on the board and maybe way of unroll to search r,g,b at once?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D

guga

  • Member
  • *****
  • Posts: 1043
  • Assembly is a state of art.
    • RosAsm
Re: Fast Compare Real8 with SSE
« Reply #28 on: February 03, 2019, 04:16:32 AM »
Tks Daydreamer. :t :t

I´ll take a look at the chebyshev on the board and see if the adaptation to the functions works ok.

About a way to unroll to search for rgb at once...well..it could be done but there would lead to some problems  and one of them is a conceptual one.

1s) Creating such a method would requires at least one huge table. Each table should have something around 398 Mb. Since the range of RGB is 0FFFFFF we will searching for 16581375 pixels. So, 16581375*8 (size of each Real8) * 3 (3 elements, one for Luminance, other for Chroma and other for hue) we will end up on a table with almost 398 Mb to search within. If we need 2 more tables it will something around 800 Mb and so on.

2nd) The conceptual problem is, actually, the harder one. All those formulas uses fractions to work with and exponential/trigonometry/power math functions to compute the values of each element (Luminance, hue, chromas, etc). The main problem is that,from all those 16 millions colors a huge amount of them are indistinguishable between each other. For example, RGB = 199, 255, 251 produces practically the same result (in chroma/hue/luminance) as 198, 254, 250 or 197, 255, 251 or 196, 255, 251. The human eye can´t identify which color is which. So, on a conceptual point of view all of those values represents the exact same color. So it would be hard to choose which color combination to use to convert back (From CieLCh/CieLab/XYZ to RGB) or simply by making the pointer to a table which will then points to another location.

What i found out on those perpcetual colorspaces is that they all claims that Luminance is completelly isolated from Chroma or Hue but, in fact, on their formulas Luma influences chroma and hue (although on a minor percentage). I also found a meaning of what is actually a gray color. Analysing the behaviour of luminance i found out that what we call gray is nothing more then a range of luminance spread in 256 pieces.

So, what will determine which gray color a RGB combination will have is basically the luminance.

So, for example, no matter what are the RGB combination used or what is the chroma or hue etc, whenever the luminance have a range in between
0 to a value smaller then 2.741066938704112e-1 it always will return in 0 as gray.

Therefore since gray represents the amount of luminance of a pixel, it also means that colors (on their isolated channels) are nothing more then the variation of luminance and thus, gray range. Thinking on that i came up with a table to be used to convert from luma to gray (according to the chosen model: Adfobe RGB 1998, sRGB, HDTV etc etc). It is something like this:

    Gray  Luminosity (Min)
    0       0
    1       2.741066938704112e-1
    2       5.48213387740825841e-1
    3       8.22320081611237263e-1
    4       1.0964267754816519
    5       1.37053346935206344
    6       1.64464016322247808
    7       1.91874685709288939
(...)
    253     99.309586872082832
    254     99.6549222327689391
    255     100

    So, the range of gray can be interpreted as:
   
    Color   Luminosity range
    0       0 to < 2.741066938704112e-1
    1       >= 2.741066938704112e-1 and < 5.48213387740825841e-1
    2       >= 5.48213387740825841e-1 and < 8.22320081611237263e-1
    ...
    254     >= 99.6549222327689391 and < 100
    255     = 100


This lead me to another problem using the "common" formulas for RGBtoCieLab/RGBtoLCH. Assuming that we are working with tiny ranges of gray, i tried to make the RGBtoCieLab/RGBtoLCH function fixes the range to it´s minimum. So, for example, if a color combination results on a luminance of 6.7e-1, it means that this color luminance is related to the color 2 (Gray = 2), thus i´ll need to decrease the luminance value to it fit´s the minimum of 5.48213387740825841e-1.

But, then a problem came up. When i do this, the resultant values are converted accordly, so it reduces the chroma, but....since the formula also relates Y (From XYZ) to luminance and Y is directly related to the R,G,B pixels, if i reduced the value of luminance i´ll need at end to proportionally reduce the value of R,G,B individually.

The main problem is that when i do the convertion backwards, it will also reduce the value to a new RGB range. For example, inputing RGB = 199, 255, 251, it is in fact this combination: RGB = 198, 254, 250 (if the reduction is proportional) or RGB = 196, 255, 251 (if the reduction is biased on Chroma only).

But...when i convert back those reducted values 198, 254, 250 or 196, 255, 251 they will keep the proportion and reduce again decreasing each pixel value by 1.

This is because the current formulas used to compute the perceptual colorspaces works with Fractions to compute pixels whose values are integer (There´s no such a thing as a red = 154.56564...The value is simply 154 (or 155 according to  the used rounding mode).

To fix that i´ll have to find the limits in between Chroma, Hue and luminance, regardless their claims (false ones) that they are completely separated things.

The correct equation to compute all of that crap is this something like this.

Z = (Chroma*sin(hue)/200) + (Luma+16)/116

Where Z is the Z element of the XYZ colorspace.

But...Z is directly related to R,G,B pixels after multiplying by a matrix, in the form of: Red*Matrix7+Green*Matrix8+Blue*Matrix9

If i reduce proportionally, i´ll necessarily assume that Red, Green and Blue must be proportionally reduced as well....Or one of them will be readjusted to fits to the Z value. In any case, the resultant chroma and hue will be altered on a way that the backwards computation will be impossible because one of the channels will always be reduced on each time we use the function, due to the problem of using integers to compute fraction-ed elements and trying to do the backwards computation where there no such a thing as a fraction-ed red, green or blue.

Not sure if i´m being clear about all of this. The bottom line is that those convertion algorithms suchs  :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

I´m pretty sure that i´ll need to find some limits to work on those things and end to build 2 or 3 more tables. The problem is find them for each color range. :redface: :redface:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

  • Member
  • ****
  • Posts: 907
  • watch Chebyshev on the backside of the Moon
Re: Fast Compare Real8 with SSE
« Reply #29 on: February 03, 2019, 05:06:55 AM »
sounds very complicated

question,if I want to tween between two pixels with colorspace HSV or CieLCH ,or any colorspace with Hue,maybe I need some IF to check direction,so I dont end up with hue pixel1=5 degrees,hue pixel2=350 degrees,tweens between 5 and 350 degrees,when it should tween for example 5degrees,2.5,0,357.5,355,352.5,350 instead?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D