Author Topic: Fast Compare Real8 with SSE and ColorSpaces  (Read 12586 times)

jj2007

• Member
• Posts: 10556
• Assembler is fun ;-)
Re: Fast Compare Real8 with SSE
« Reply #15 on: January 30, 2019, 10:30:24 PM »
given a table of 256 Real8 values from 0 to 100 (And all in crescent order from 0 to 100 only)

I've given it a try for the naive scan method:
Code: [Select]
`Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz10000000 random searches performed in 1157 ms`
The source:

Init
Dim table() As REAL8          ; create an array
Let edi=New\$(256*DWORD)       ; and a DWORD table
push edi                      ; we need it later
Rand()
For_ ecx=0 To 255
Rand(0, 100, table(ecx))
Next
ArraySort table()             ; sort double array ascending
Open "O", #1, "TheTable.dat"  ; save the array to disk
ArrayStore #1, table()
Close
For_ ecx=0 To 255
movlps xmm0, table(ecx) ; get a value from the table
.if ecx<5 || ecx>=255-4
.if Zero?
PrintLine "..."
.endif
Print Str\$(ecx), Str\$("\t%9f\n", f:xmm0)        ; print the first and last values
.endif
movd eax, xmm0
stosd                   ; store the double's low DWORD
Next
pop edi
loops=10000000
push loops-1
PrintCpu 0
NanoTimer()
.Repeat
mov ecx, 256
mov ebx, Rand(ecx)      ; test a random search position
mov eax, DWORD PTR table(ebx)
push edi
push ecx
repnz scasd             ; this is the search algo ;-)
pop edx
sub ecx, edx
not ecx
pop edi
.if ecx!=ebx            ; check if the found position is the expected one
Print Str\$("\n## Mismatch at pos %i", ecx), Str\$("<>%i ##\n\n", ebx)
.Break
.endif
dec stack
.Until Sign?
Print Str\$("%i random searches performed in ", loops), NanoTimer\$()
EndOfCode

The binary search could be faster but also trickier. You would have to compare the HIGH dwords, and they are much more likely to have duplicates.

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #16 on: January 31, 2019, 06:34:41 PM »
Tks a lot, Jochen and AW.

JJ, I´ll give more tests but it seems that your idea on using the low dword worked. I also came up with a routine biased on your´s and on AW, can you please benchmark this for me and see if it is working ?

Compare Double fpu (the low dword, in fact).
Code: [Select]
`Proc FastCompare4_Double:    Arguments @pTable, @Value, @Size    Uses esi, ebx, ecx, edi        mov edi D@Size        xor eax eax        shl edi 1 ; always mul by 2 at the start. A Real8 is double the size of a dword.        mov esi D@pTableL1:        ; half = edi        shr edi 1        lea ecx D\$edi+eax | mov ebx D\$esi+ecx*4+4; Otherlow = low+half        If D@Value = ebx            mov eax ecx ; Commented out the inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX            shr eax 1            ExitP        End_If        cmovg eax ecx               test edi edi | jne L1<EndP___________________________________Example of usage:[TestingTable1: D\$ 0, 0TestingTable2: D\$ 0, 1TestingTable3: D\$ 0, 2TestingTable4: D\$ 0, 3TestingTable5: D\$ 0, 4TestingTable6: D\$ 0, 5TestingTable7: D\$ 0, 6TestingTable8: D\$ 0, 7TestingTable9: D\$ 0, 8....TestingTable255: D\$ 0, 254TestingTable256: D\$ 0, 255]call FastCompare4_Double TestingTable1, 177, 256`
Compare Dwords only

Code: [Select]
`Proc FastCompare4:    Arguments @pTable, @Value, @Size    Uses esi, ebx, ecx, edi    mov esi D@pTable    xor eax eax ; eax = low     mov edi D@SizeL1:        ; half = edi        shr edi 1        lea ecx D\$edi+eax | mov ebx D\$esi+ecx*4; Otherlow = low+half        If D@Value = ebx            mov eax ecx; Commented out the inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX            ExitP        End_If        cmovg eax ecx        test edi edi | jne L1<EndP_______________________________________Example of usage:[MyGugaTestTable1: D\$ 0MyGugaTestTable2: D\$ 1MyGugaTestTable3: D\$ 2MyGugaTestTable4: D\$ 3MyGugaTestTable5: D\$ 4MyGugaTestTable6: D\$ 5MyGugaTestTable7: D\$ 6MyGugaTestTable8: D\$ 7MyGugaTestTable9: D\$ 8(...)MyGugaTestTable254: D\$ 253MyGugaTestTable255: D\$ 254MyGugaTestTable256: D\$ 255]call FastCompare4 MyGugaTestTable1, 177, 256`

I hope that those functions are faster then the logarithm computation used on the original function :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

• Member
• Posts: 10556
• Assembler is fun ;-)
Re: Fast Compare Real8 with SSE
« Reply #17 on: January 31, 2019, 06:47:04 PM »

I'm afraid my assembler won't accept the syntax... do you have a translator?

Btw that table of REAL8s - is that always the same numbers, or are they randomly created? If they are the same, testing once if there are duplicates would be sufficient. If not, there is a hypothetical chance to find a duplicate, and that would require one extra run.

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #18 on: January 31, 2019, 11:19:59 PM »
Hi JJ. Tks :)

I believe the syntax is like this

Code: [Select]
`pTable = Pointer to the table of 256 Real8Value = The value to be found. Any IntegerSize = Integer. Size of the table. Dummy, i know...since we are testing only 256 Real8, but i did like that to remember later of what i was doing :) FastCompare4_Double proc pTable: Dword, Value:Dword, Size:Dword                push    ebp               mov     ebp, esp                push    esi                push    ebx                push    ecx                push    edi                mov     edi, Size ; The size of a Real8 is always 2x the size of an integer. So, we need to double it´s size here so we can compare the value with only it´s low dword later on mov ebx [esi+ecx*4+4]                xor     eax, eax                shl     edi, 1                mov     esi, pTable ; or.. offset pTable (I forgot the masm syntax. But here is a pointer not the value itself.loc_45AEA1:                shr     edi, 1                lea     ecx, [eax+edi]                mov     ebx, [esi+ecx*4+4]                cmp     Value, ebx ; The valuye to be found is compared to the Low Dword from our Real8 table                jnz     short loc_45AEB8                mov     eax, ecx                shr     eax, 1 ; Note: We are retrieving only the index and not the actual pos. For Real8 computations, divide the result by 2 to it retrieve the proper index. To retrieve the actual pos simply increase this result by 1                jmp     loc_45AEC4 ; Since we found the value we can exit the function :) loc_45AEB8:                cmovg   eax, ecx                test    edi, edi                jnz     short loc_45AEA1                mov     eax, 0FFFFFFFFh ; Nothing was found. Return -1loc_45AEC4:                pop     edi                pop     ecx                pop     ebx                pop     esi                mov     esp, ebp                pop     ebp                retn    0ChFastCompare4_Double endp; ---------------------------------------------------------------------------`

About the Real8 values...they are all the same. They are created from a computation of gamma and offset, whoose values are fixed for every color model, with the exception of only few models (sRGB HDTV and Adobe RGB 1998) that uses these, respectivelly: Gamma: 2.19921875, Offset: 0.055 and Gamma: 2.2, Offset: 0.099). All other models uses the same values for gamma and offset (2.2 for gammaa and 0.055 for offset. Other few models uses values as 1.8 for gamma).

The resultant values of Real8 (i named them as "Kfactor") on the table are given by the following equation:

ColorNormnalized = Color/255 ; The table of 256 Real8 is created using a range of integers from Color = 0 to 255. I mean, i created a function and insert integer values (not normalized) from 0 to 255 to it compute the Kfactor values for each one of the 256 colors.

KFactor = [ColorNormalized+Offset)/(1+Offset))^Gamma]  * 100 ; If ColorNormalized (Color/255) is bigger then Treshold
KFactor = (ColorNormalized/Slope)  * 100 ; If ColorNormalized (Color/255) is smaller or equal to Treshold

Threshold  and slope are calculated from Gamma and offset like this:

Threshold  = Offset/(Gamma-1)
Slope = [Offset * ( (Gamma-1)^(Gamma-1) )] * [(Offset+1)/(Offset*Gamma)]^Gamma

The only fixed values are, offset and gamma that the user must insert to create threshold and slope for the KFactor computation.
« Last Edit: February 01, 2019, 02:08:23 AM by guga »
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #19 on: February 01, 2019, 01:02:41 AM »
Note:

The values of gamma varies from 1.1 to 2.4
Offset is a tiny value. 0.055 (or 0.099 for HDTV only). The values seems to be x/1000 where x >0 < 100, but i´m not sure how they are achieved. I stablished the limits for gamma to avoid division by zero. But for offset i´ll take a look later. In cases where offset are 0 the correct would result in a slope of 1, if i assume that 0/0 = 1. But i don´t remember if for gamma and offset it do exists a formula to retrieve those values or they are only made by the user/manufacturer of a monitor for example.

The formulas can be found here:
https://en.wikipedia.org/wiki/SRGB
http://www.marcelpatek.com/color.html
http://hem.bredband.net/egostudios/gamma_management.htm
http://www.marcelpatek.com/gamma.html
https://web.archive.org/web/20130718042406/http://www.arcsynthesis.org/gltut/Illumination/Tut12%20Monitors%20and%20Gamma.html

Marcel Patek´s  formulas seems incorrect as mentioned earlier.
« Last Edit: February 01, 2019, 02:05:58 AM by guga »
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

• Member
• Posts: 10556
• Assembler is fun ;-)
Re: Fast Compare Real8 with SSE
« Reply #20 on: February 01, 2019, 03:03:44 AM »
Hi Guga,

I've tried my luck but your algo returns always -1 (and you haven't been generous with comments). This is supposed to be binary search, right? What's wrong? Does it work on your end?

The parameters are passed correctly:
Code: [Select]
` mov edi, _Size ; 100h, ok xor eax, eax mov esi, pTable ; OK ; ^ ^ at this point: edi=100h, esi points to REAL8 values ^ ^`
Code: [Select]
`option prologue:noneoption epilogue:noneif 0  pTable = Pointer to the table of 256 Real8  Value = The value to be found. Any Integer  Size = Integer. Size of the table. Dummy, i know...since we are testing only 256 Real8, but i did like that to remember later of what i was doing :)endif;   invoke FastCompare4_Double, addr table(0), DWORD PTR table(ebx), ecxFastCompare4_Double proc pTable: Dword, Value:Dword, _Size:Dword push ebp mov ebp, esp push esi push ebx push ecx push edi mov edi, _Size ; 100h, ok xor eax, eax shl edi, 1 mov esi, pTable ; OK or.. offset pTable (I forgot the masm syntax. But here is a pointer not the value itself.loc_45AEA1: shr edi, 1 lea ecx, [eax+edi] mov ebx, [esi+ecx*4+4] cmp Value, ebx ; or.. offset Value (I forgot the masm syntax. But here is a pointer not the value itself. jnz short loc_45AEB8 mov eax, ecx shr eax, 1 jmp loc_45AEC4loc_45AEB8: cmovg eax, ecx test edi, edi jnz short loc_45AEA1 mov eax, 0FFFFFFFFhloc_45AEC4: pop edi pop ecx pop ebx pop esi mov esp, ebp pop ebp retn 0ChFastCompare4_Double endpOPTION PROLOGUE:PrologueDefOPTION PROLOGUE:PrologueDef`

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #21 on: February 01, 2019, 04:35:55 AM »
Oops..i forgot to reinsert the comments. Sorry  :redface: :redface: :redface: :redface:I removed them after optimizing. Here is the non optimized version with comment.

Here both are working, but only when it finds a match of the exact value. When it tries to find a value in between 2 it returns also -1

For instance, when the table is
[0, 5, 9, 12]
If the value to be found is 7, it will return -1.

I forgot to adapt the routine on the cases for the next value be bigger. Ex: 7 is in between 5 and 9. So it must return index 1

Here is the non optimized masm version. (For the exact match, i mean, and not the other to searching for values in between (i´m working on it, right now and see if i can be able to use the cmovge/cmovle as well to optimize:) )

Code: [Select]
`FastCompare3_Double proc nearProbe           = dword ptr -1Chmid             = dword ptr -14hLow             = dword ptr -10hhalf            = dword ptr -8CurSize         = dword ptr -4pTable          = dword ptr  8MyValue         = dword ptr  0ChSize            = dword ptr  10h                push    ebp                mov     ebp, esp                sub     esp, 1Ch                push    esi                push    ecx                push    edx                mov     [ebp+Low], 0 ; Start LowVariable = 0                mov     eax, [ebp+Size]                shl     eax, 1                mov     [ebp+CurSize], eax ; The Size always is mul by 2 because Real8 is 2x bigger then a Dword. So, current size starts with 512  (256*2)                mov     esi, [ebp+pTable]  ; Pointer to our table of real8loc_45AF7E:                             ; CODE XREF: FastCompare3_Double+6A↓j                cmp     [ebp+CurSize], 0  ; On each loop the current size is being decreased. Size is used as a counter.                jbe     loc_45AFCF                mov     ecx, [ebp+CurSize]                 shr     ecx, 1                mov     [ebp+half], ecx  ; half = cursize/2. half means searching for half of the table                mov     ecx, [ebp+Low]                add     ecx, [ebp+half]                mov     [ebp+mid], ecx ; Mid is what will return. Here it is Mid = Low+CurSize/2                mov     eax, [esi+ecx*4+4] ; esi will point to the middle of the table related to the half of the current position scnanned                mov     [ebp+Probe], eax   ; The value from the table to check                mov     eax, [ebp+MyValue] ; The inputed value                cmp     eax, [ebp+Probe]                jnz     short loc_45AFB4                mov     eax, [ebp+mid] ; If MyValue is equal to the one being checked, we found the match.                shr     eax, 1         ; since we are working with Real8, the original size was doubled. So we need to divide back by 2 to get the proper index. It will return the index. To return the pos, simply add 1 after the shr eax...                jmp     loc_45AFCF     ; Exit our loop; ---------------------------------------------------------------------------;                jmp     short loc_45AFC1 ; Left over by RosAsm Else_If macro. Can be commented out; ---------------------------------------------------------------------------loc_45AFB4:                             ; CODE XREF: FastCompare3_Double+46↑j                cmp     eax, [ebp+Probe] ; If the value from the table is bigger then My Value, it means we advanced the position. We lost he spot in forwards. Ex: MyValue = 5. Table ..... 2, 3, 5, 6 (6 = Probe). So Probe is forward the value to be found                jbe     short loc_45AFBE                mov     eax, [ebp+mid]    ; If we are forward it, get back to the previous position.                jmp     short loc_45AFC1; ---------------------------------------------------------------------------loc_45AFBE:                             ; CODE XREF: FastCompare3_Double+57↑j                mov     eax, [ebp+Low]  ; If we are here, it means that MyValue is located before the Probe. So, we are backwards. Ex:  MyValue = 5. Table ..... 2, 3 (Probe), 5, 6 . So Probe is before the value to be foundloc_45AFC1:                             ; CODE XREF: FastCompare3_Double+52↑j                                        ; FastCompare3_Double+5C↑j                mov     [ebp+Low], eax   ; Save the new position                mov     eax, [ebp+half]  ; Current size is always the half of what is beeing scanned. Perform the look if we are not over (So, if Curize is bigger then 0)                mov     [ebp+CurSize], eax                jmp     loc_45AF7E; ---------------------------------------------------------------------------loc_45AFCF:                             ; CODE XREF: FastCompare3_Double+22↑j                                        ; FastCompare3_Double+4D↑j                pop     edx                pop     ecx                pop     esi                mov     esp, ebp                pop     ebp                retn    0ChFastCompare3_Double endp`
To use the function you can do this:

Code: [Select]
`Myreal8Table    dd 3 dup(0), 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0dd 8, 0, 9, 0, 0Ah, 0, 0Bh, 0, 0Ch, 0, 0Dh, 0, 0Eh, 0dd 0Fh, 0, 10h, 0, 11h, 0, 12h, 0, 13h, 0, 14h, 0, 15hdd 0, 16h, 0, 17h, 0, 18h, 0, 19h, 0, 1Ah, 0, 1Bh, 0, 1Chdd 0, 1Dh, 0, 1Eh, 0, 1Fh, 0, 20h, 0, 21h, 0, 22h, 0, 23hdd 0, 24h, 0, 25h, 0, 26h, 0, 27h, 0, 28h, 0, 29h, 0, 2Ahdd 0, 2Bh, 0, 2Ch, 0, 2Dh, 0, 2Eh, 0, 2Fh, 0, 30h, 0, 31hdd 0, 32h, 0, 33h, 0, 34h, 0, 35h, 0, 36h, 0, 37h, 0, 38hdd 0, 39h, 0, 3Ah, 0, 3Bh, 0, 3Ch, 0, 3Dh, 0, 3Eh, 0, 3Fhdd 0, 40h, 0, 41h, 0, 42h, 0, 43h, 0, 44h, 0, 45h, 0, 46hdd 0, 47h, 0, 48h, 0, 49h, 0, 4Ah, 0, 4Bh, 0, 4Ch, 0, 4Dhdd 0, 4Eh, 0, 4Fh, 0, 50h, 0, 51h, 0, 52h, 0, 53h, 0, 54hdd 0, 55h, 0, 56h, 0, 57h, 0, 58h, 0, 59h, 0, 5Ah, 0, 5Bhdd 0, 5Ch, 0, 5Dh, 0, 5Eh, 0, 5Fh, 0, 60h, 0, 61h, 0, 62hdd 0, 63h, 0, 64h, 0, 65h, 0, 66h, 0, 67h, 0, 68h, 0, 69hdd 0, 6Ah, 0, 6Bh, 0, 6Ch, 0, 6Dh, 0, 6Eh, 0, 6Fh, 0, 70hdd 0, 71h, 0, 72h, 0, 73h, 0, 74h, 0, 75h, 0, 76h, 0, 77hdd 0, 78h, 0, 79h, 0, 7Ah, 0, 7Bh, 0, 7Ch, 0, 7Dh, 0, 7Ehdd 0, 7Fh, 0, 80h, 0, 81h, 0, 82h, 0, 83h, 0, 84h, 0, 85hdd 0, 86h, 0, 87h, 0, 88h, 0, 89h, 0, 8Ah, 0, 8Bh, 0, 8Chdd 0, 8Dh, 0, 8Eh, 0, 8Fh, 0, 90h, 0, 91h, 0, 92h, 0, 93hdd 0, 94h, 0, 95h, 0, 96h, 0, 97h, 0, 98h, 0, 99h, 0, 9Ahdd 0, 9Bh, 0, 9Ch, 0, 9Dh, 0, 9Eh, 0, 9Fh, 0, 0A0h, 0dd 0A1h, 0, 0A2h, 0, 0A3h, 0, 0A4h, 0, 0A5h, 0, 0A6h, 0dd 0A7h, 0, 0A8h, 0, 0A9h, 0, 0AAh, 0, 0ABh, 0, 0ACh, 0dd 0ADh, 0, 0AEh, 0, 0AFh, 0, 0B0h, 0, 0B1h, 0, 0B2h, 0dd 0B3h, 0, 0B4h, 0, 0B5h, 0, 0B6h, 0, 0B7h, 0, 0B8h, 0dd 0B9h, 0, 0BAh, 0, 0BBh, 0, 0BCh, 0, 0BDh, 0, 0BEh, 0dd 0BFh, 0, 0C0h, 0, 0C1h, 0, 0C2h, 0, 0C3h, 0, 0C4h, 0dd 0C5h, 0, 0C6h, 0, 0C7h, 0, 0C8h, 0, 0C9h, 0, 0CAh, 0dd 0CBh, 0, 0CCh, 0, 0CDh, 0, 0CEh, 0, 0CFh, 0, 0D0h, 0dd 0D1h, 0, 0D2h, 0, 0D3h, 0, 0D4h, 0, 0D5h, 0, 0D6h, 0dd 0D7h, 0, 0D8h, 0, 0D9h, 0, 0DAh, 0, 0DBh, 0, 0DCh, 0dd 0DDh, 0, 0DEh, 0, 0DFh, 0, 0E0h, 0, 0E1h, 0, 0E2h, 0dd 0E3h, 0, 0E4h, 0, 0E5h, 0, 0E6h, 0, 0E7h, 0, 0E8h, 0dd 0E9h, 0, 0EAh, 0, 0EBh, 0, 0ECh, 0, 0EDh, 0, 0EEh, 0dd 0EFh, 0, 0F0h, 0, 0F1h, 0, 0F2h, 0, 0F3h, 0, 0F4h, 0dd 0F5h, 0, 0F6h, 0, 0F7h, 0, 0F8h, 0, 0F9h, 0, 0FAh, 0dd 0FBh, 0, 0FCh, 0, 0FDh, 0, 0FEh, 0, 0FFh_________________________________________________________________push    256push    253push    offset Myreal8Tablecall    FastCompare3_Double`

Sorry for the assembly listing. I tried to port to masm, but i forgot the proper syntax, so i had to disassemble the function to you see.

In RosAsm the function looks like:

Code: [Select]
`Proc FastCompare3_Double:    Arguments @pTable, @Value, @Size    Local @CurSize, @half, @Other_Half, @Low, @mid, @Probe    Uses esi, ecx, edx    mov D@Low 0    mov eax D@Size | shl eax 1 | mov D@CurSize eax    mov esi D@pTable    ..While D@CurSize > 0        mov ecx D@CurSize | shr ecx 1 | mov D@half ecx ; half = cursize/2        mov ecx D@Low | add ecx D@half | mov D@Mid ecx | mov eax D\$esi+ecx*4+4 | mov D@Probe eax        mov eax D@Value        If eax = D@Probe            mov eax D@Mid; | inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX            shr eax 1            ExitP        Else_If eax > D@Probe            mov eax D@Mid        Else            mov eax D@low        End_If        mov D@Low eax        mov eax D@half | mov D@CurSize eax    ..End_WhileEndP`

The cmovge is a replacement to the routines in the Else_If macro.  On this example, we are computing the exact value. If the value on the table is > MYValue we use Mid as the checking value. Otherwise we use the previous "low". So we exchange the value according to the condition on the jmp (Greater then on this case).

I never used cmovge before. But it seems to work as expected. This algorithm i ported and adapted from:

I´ll try to fix to it handles  the cases of a value to be found in between the previous and the next one. Probably will need another cmovge or cmovle. I[´ll see if i can fix.

So far, the currrent version is working here for exact match only.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

• Member
• Posts: 10556
• Assembler is fun ;-)
Re: Fast Compare Real8 with SSE
« Reply #22 on: February 01, 2019, 11:48:35 AM »
Sorry, but I can't get it to work, Guga. In the meantime, I have designed a binary search algo, using the hiwords of the REAL8s. It is about twice as fast as repne scasd:
Code: [Select]
`Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz1750000 random searches performed in 191 ms - repnz scasd1750000 random searches performed in 97 ms1750000 random searches performed in 95 ms1750000 random searches performed in 95 ms1750000 random searches performed in 95 ms1750000 random searches performed in 95 ms1750000 random searches performed in 95 ms1750000 random searches performed in 96 ms1750000 random searches performed in 96 ms1750000 random searches performed in 95 ms1750000 random searches performed in 93 ms`

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #23 on: February 02, 2019, 12:21:45 AM »
Tks a lot, JJ :t :t :t

Your version seems to work as expected including the values in between the limits :)

I tried to port the cmovge to masmbasic, but it seems to contains lots of errors while retrieving the values. A mismatch is being caused by a error of 1 or 2 missed values. So, when it was supposed to return 182 it returned 181 etc. Also your version seems to be faster. :t :t :t  But, if you want to try, this is the ported code. (Have lots of mismatches)

Code: [Select]
`BinSearchP proc uses esi edi ebx edx ecx pTable, Val8:REAL8, numel  mov edi, numel ; eax = size  dec edi  xor eax, eax  shl edi, 1  mov esi, pTable ; leftNotDone: shr edi, 1 lea ecx, [edi+eax] mov ebx, [esi+8*ecx+4] cmp dword ptr Val8[4], ebx cmovge eax, ecx  test edi, edi jne NotDone cmp eax, dword ptr Val8[4] jnae Exiting      dec eaxExiting:  retBinSearchP endp`
I gave a try and inserted your version on my CieLCH routine for colorize images and the result seems accurate. It do occurs some clippings on the Chroma yet, but it is most likely due to the limitation of the ratios between hue and chroma for a given Luminosity. I´ll see how to fix that later.

The binary search routine you did works like a charm. Now, i´ll need to figure it out how to optimize a routine for sin, cosine and atan2 (arctang using atan2 function) computations. Optimizing those trigonometry functions willl make the convertion between RGBtoCieLCh/CieLChtoRGB works, at least, 2 to 4x faster then what it is actually. :icon_cool: :icon_cool: :icon_cool:

If you have somee examples of faster (and accurate at the same time) sin, cossine and atan2 fucntions, please post it so i can take a look and see if i can adapt to it those color convertion routines 8) 8) 8)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

• Member
• Posts: 10556
• Assembler is fun ;-)
Re: Fast Compare Real8 with SSE
« Reply #24 on: February 02, 2019, 01:39:46 AM »
Thanks, Guga. There seems to be a little bug in your algo, but right now I have no time to chase it. But I attach a version that has both algos implemented. You just need to change the string on top:

BinSearch\$ equ BinSearchJ       ; use BinSearchG to test Guga's version

My version is not foolproof yet, either: search the source for seed, there is one that produces in pos 5/6 two very close numbers. Since the algo checks only the high DWORD of the REAL8, the comparison depends on a handful of bits of the mantissa only. That works most of the time, and it works always if the table is a known one, but for a random table it may create mismatches. It may not even be a problem, though. I would have to see concrete applications of such an algo. It is indeed fast, maybe that's the only thing that counts.

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #25 on: February 02, 2019, 03:52:37 AM »
Tks JJ.

The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table as you did with:
Code: [Select]
`  cmp ebx, [esi] ; <left?  js @err  cmp ebx, [edx] ; >right?  jg @err`
But, i was amazed it was also fast (despite the errors).

Quote
My version is not foolproof yet, either: search the source for seed, there is one that produces in pos 5/6 two very close numbers.
Maybe another final check for the Loword of the HIDWORD should make the trick to fix. It would work as the original verson but immediatelly after

Code: [Select]
`  sub edx, pTable ; middle pos - original left  sar edx, 3  xchg eax, edx`
So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.

Quote
I would have to see concrete applications of such an algo. It is indeed fast, maybe that's the only thing that counts.

Well...i can think on a "few" pratical applications where this couldd be used :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:  A faster binary search (for Real8 on this particular case and this situations of scanning on a ordered list) can be of extreme use for video, image and audio processing.  On the functions i´m creating for image processing, such algorithm is a must, speciaqlly because colorpaces functions makes heavy usage of complex mathematical equations that could be easily replaced by simple pointers to tables  and here iss where a faster binary search can be used.

For example, on my original version of CIELCHtoRGB, the part that actually converts the Value (In KFactor) to Red, Green or Blue colors are given by the following formula:

FinalColor = ((TempColor+0.055)/1.055)^2.4

The final color is computed also after checking for the threshold as i mentioned earlier

KFactor = [ColorNormalized+Offset)/(1+Offset))^Gamma]  * 100 ; If ColorNormalized (Color/255) is bigger then Treshold
KFactor = (ColorNormalized/Slope)  * 100 ; If ColorNormalized (Color/255) is smaller or equal to Treshold

On my original version i used the same way people often uses to calculate this stuff. So i inserted the formula on the final part of the color convertions routines such as:

Code: [Select]
`Proc CieLCHtoRGB:    Arguments @pLuminance, @pChroma, @pHue, @Red, @Green, @Blue, @Flag, @WhiteRef    Structure @TempStorage 64, @pXDis 0, @pYDis 8, @pZDis 16, @TmpRedDis, 24, @TmpGreenDis 32, @TmpBlueDis 40, @pAFactorDis 48, @pBFactorDis 56    Uses esi, ebx, ecx, edx, edi    finit(.....)    lea ecx D@TmpRedDis | fld R@pXDis | fmul R\$esi+FloatMatrices.M1Dis | fld R@pYDis | fmul R\$esi+FloatMatrices.M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R\$esi+FloatMatrices.M3Dis | faddp ST1 ST0 | fstp R\$ecx    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R\$esi+FloatMatrices.M4Dis | fld R@pYDis | fmul R\$esi+FloatMatrices.M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R\$esi+FloatMatrices.M6Dis | faddp ST1 ST0 | fstp R\$ecx    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R\$esi+FloatMatrices.M7Dis | fld R@pYDis | fmul R\$esi+FloatMatrices.M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R\$esi+FloatMatrices.M9Dis | faddp ST1 ST0 | fstp R\$ecx    call GammaLinearDecodingEx F_gamma, F_Offset, F_Slope, F_Treshold, F_OffsetPlusOne, D@Flag    lea ecx D@TmpRedDis    .Fpu_If R\$ecx > R\$F_Treshold        ;Color = ((Color+0.055)/1.055)^2.4        fld R\$F_gamma | fld R\$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0        fmul R\$F_OffsetPlusOne | fsub R\$F_Offset    .Fpu_Else        fld R\$ecx | fmul R\$F_Slope    .Fpu_End_If    mov ecx D@Red | fmul R\$Float255 | fistp F\$ecx    lea ecx D@TmpGreenDis    .Fpu_If R\$ecx > R\$F_Treshold        ;Color = ((Color+0.055)/1.055)^2.4        fld R\$F_gamma | fld R\$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0        fmul R\$F_OffsetPlusOne | fsub R\$F_Offset    .Fpu_Else        fld R\$ecx | fmul R\$F_Slope    .Fpu_End_If    mov ecx D@Green | fmul R\$Float255 | fistp F\$ecx    lea ecx D@TmpBlueDis    .Fpu_If R\$ecx > R\$F_Treshold        ;Color = ((Color+0.055)/1.055)^2.4        fld R\$F_gamma | fld R\$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0        fmul R\$F_OffsetPlusOne | fsub R\$F_Offset    .Fpu_Else        fld R\$ecx | fmul R\$F_Slope    .Fpu_End_If    mov ecx D@Blue | fmul R\$Float255 | fistp F\$ecx    ; clip values outside the range    mov ecx D@Red    If D\$ecx <s 0        mov D\$ecx 0    Else_If D\$ecx > 255        mov D\$ecx 255    End_If    mov ecx D@Green    If D\$ecx <s 0        mov D\$ecx 0    Else_If D\$ecx > 255        mov D\$ecx 255    End_If    mov ecx D@Blue    If D\$ecx <s 0        mov D\$ecx 0    Else_If D\$ecx > 255        mov D\$ecx 255    End_IfEndP`
See the problem ? To achieve the final color we need to do a bunch of mathematical operations using power functions to retrieve the proper value.

What i did was revert the formula and use them as a table that is pre-calculated way before CieLCHtoRGB is actually running. I found ouyt that the resultant values that generates the color have  fixed fractions (i named them as Kfactor). And the total amount of fractions are only 256 ! So, one fraction per color.

All i had to do is create a routine to calculate all those values and insert them on a huge structure, which i named as "WSMatrix" (So far, the size of this structure is 4388 bytes, but i´ll probably increase it because i need 2 or 3 more tables). I created a function called SetupWorkSpaceMatrixDataEx that can be used when the application starts, for example or under user choice only once.

The function SetupWorkSpaceMatrixDataEx pre-calculate literraly everything needed to convert from RGB to CieLCH and vice-versa, inserting all data (gamma, offset, white references, creating new matrices, calculating multiplicand fractions, stablishing limits etc etc) and put all of that on a single structure WSMatrix to be used internally.

The usage of a table and the pre-calculation of all those values is a must, specially because whenever i´m analysing a image/video etc, i no longer need to make all those computations for every pixel. All it is needed is for CieLCHtoRGB convertion is basically points to the precalculated data from WSMatrix and do simple math operations (if needed).

It´s a major advantage in terms of speed computation because we are no longer calculating everything for each pixel. For example, all those monster computationss i replaced with a simple :

Code: [Select]
`Proc CieLCHtoRGB_Ex:    Arguments @pLuminance, @pChroma, @pHue, @Red, @Green, @Blue, @pMatrix    finit    mov edx D@pChroma    lea edi D@pAFactorDis | mov esi D@pHue | fld R\$esi | fmul R\$Degree_Radian | fcos | fmul R\$edx | fstp R\$edi    lea edi D@pBFactorDis | mov esi D@pHue | fld R\$esi | fmul R\$Degree_Radian | fsin | fmul R\$edx | fstp R\$edi(...)    lea ecx D@TmpRedDis | fld R@pXDis | fmul R\$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R\$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R\$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R\$ecx    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R\$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R\$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R\$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R\$ecx    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R\$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R\$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R\$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R\$ecx    lea ebx D\$esi+WS_Matrix.KFactorMapDis    mov edi D@Red    fld R@TmpRedDis | fmul R\$Float100 | fstp R@TmpColorCheckDis    lea eax D@TmpColorCheckDis    call BinarySearch ebx, D\$eax+4, 256    mov D\$edi eax    mov edi D@Green    fld R@TmpGreenDis | fmul R\$Float100 | fstp R@TmpColorCheckDis    lea eax D@TmpColorCheckDis    call BinarySearch ebx, D\$eax+4, 256    mov D\$edi eax    mov edi D@Blue    fld R@TmpBlueDis | fmul R\$Float100 | fstp R@TmpColorCheckDis    lea eax D@TmpColorCheckDis    call BinarySearch ebx, D\$eax+4, 256    mov D\$edi eaxEndP`
I can also gain a bit more speed using the BinarySearch function inline. But i choose to use as a call to a function to see first, if it was working  (and it is  )

Didn´t even need to benchmark, since the difference in speed when using binarysearch rather then calculating "Color = ((Color+0.055)/1.055)^2.4" all the time, is noticed on naked eyes :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

If you have time and can fix the binarysearch for foolproof it will be very handy.

Also, i´ll try to see a way to optimize the sin, cosinee and atan2 computations as well. Since both convertions RGB to CieLCH and it´s reversal) makes heavy usage of trigonometry operations like those, an optimizations is also needed to speed things even more.

Just to you have a small idea on how the optimization can make things better. When i started all of those convertion routines, the images i´m posting here to transfom Gray to Color took something around 40 minutes to finish processing. After using the tables technique, the convertion was done in 4 to 8 seconds (Including the time to pre-calculate all of this). Now...with your´s optimization and the fixes i made on the convertion routines, it is taking something around 2 to 4 seconds including the precalculation routines to colorize a 640*480 grayscale image using a 960*720 reference color image.

This is to say that with further optimizations, i can reduce this total time of less then 1 second including the precalculations. So, since the precalculations are made when the app starts for example, it means that the total amount of time of coloring a image can be done in a few miliseconds.

Also, once i suceed to make the colorization method output a file formed by a kind of sample structure, then all of this can be done in almost notime. All is needed is retrieve the pointers from an external file, make the necessary adjustements to those pointers, and voialáa, you can convert a grayscale image onto a colored one almost immediatelly.

If i can be able to do this with image, the same process can be applied to video processing. For a video with let´s say 50.000 frames, if to colorize each frame it takes, let´s say 100 miliseconds per frame, it means that the video will be totally colorized in something around 15 minutes rather then years since the original version that took around 45 minutes to colorize 1 single image
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

• Member
• Posts: 10556
• Assembler is fun ;-)
Re: Fast Compare Real8 with SSE
« Reply #26 on: February 02, 2019, 04:26:11 AM »
The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table
No, that's not a problem since we only feed valid numbers. The check can be omitted.

Quote
So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.
Yes, something like that. But in most cases it's not necessary, since the numbers will be "distant" enough anyway. We are checking the high DWORD, i.e. the one that contains the exponent and part of the mantissa:

That "part" is 32-11-1=20 bits, or about 6-7 digits. Should be sufficient in most cases ;-)

daydreamer

• Member
• Posts: 1363
• building nextdoor
Re: Fast Compare Real8 with SSE
« Reply #27 on: February 03, 2019, 02:54:13 AM »
good work
I think there is a fast chebyshev on the board and maybe way of unroll to search r,g,b at once?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)

guga

• Member
• Posts: 1287
• Assembly is a state of art.
Re: Fast Compare Real8 with SSE
« Reply #28 on: February 03, 2019, 04:16:32 AM »
Tks Daydreamer. :t :t

I´ll take a look at the chebyshev on the board and see if the adaptation to the functions works ok.

About a way to unroll to search for rgb at once...well..it could be done but there would lead to some problems  and one of them is a conceptual one.

1s) Creating such a method would requires at least one huge table. Each table should have something around 398 Mb. Since the range of RGB is 0FFFFFF we will searching for 16581375 pixels. So, 16581375*8 (size of each Real8) * 3 (3 elements, one for Luminance, other for Chroma and other for hue) we will end up on a table with almost 398 Mb to search within. If we need 2 more tables it will something around 800 Mb and so on.

2nd) The conceptual problem is, actually, the harder one. All those formulas uses fractions to work with and exponential/trigonometry/power math functions to compute the values of each element (Luminance, hue, chromas, etc). The main problem is that,from all those 16 millions colors a huge amount of them are indistinguishable between each other. For example, RGB = 199, 255, 251 produces practically the same result (in chroma/hue/luminance) as 198, 254, 250 or 197, 255, 251 or 196, 255, 251. The human eye can´t identify which color is which. So, on a conceptual point of view all of those values represents the exact same color. So it would be hard to choose which color combination to use to convert back (From CieLCh/CieLab/XYZ to RGB) or simply by making the pointer to a table which will then points to another location.

What i found out on those perpcetual colorspaces is that they all claims that Luminance is completelly isolated from Chroma or Hue but, in fact, on their formulas Luma influences chroma and hue (although on a minor percentage). I also found a meaning of what is actually a gray color. Analysing the behaviour of luminance i found out that what we call gray is nothing more then a range of luminance spread in 256 pieces.

So, what will determine which gray color a RGB combination will have is basically the luminance.

So, for example, no matter what are the RGB combination used or what is the chroma or hue etc, whenever the luminance have a range in between
0 to a value smaller then 2.741066938704112e-1 it always will return in 0 as gray.

Therefore since gray represents the amount of luminance of a pixel, it also means that colors (on their isolated channels) are nothing more then the variation of luminance and thus, gray range. Thinking on that i came up with a table to be used to convert from luma to gray (according to the chosen model: Adfobe RGB 1998, sRGB, HDTV etc etc). It is something like this:

Gray  Luminosity (Min)
0       0
1       2.741066938704112e-1
2       5.48213387740825841e-1
3       8.22320081611237263e-1
4       1.0964267754816519
5       1.37053346935206344
6       1.64464016322247808
7       1.91874685709288939
(...)
253     99.309586872082832
254     99.6549222327689391
255     100

So, the range of gray can be interpreted as:

Color   Luminosity range
0       0 to < 2.741066938704112e-1
1       >= 2.741066938704112e-1 and < 5.48213387740825841e-1
2       >= 5.48213387740825841e-1 and < 8.22320081611237263e-1
...
254     >= 99.6549222327689391 and < 100
255     = 100

This lead me to another problem using the "common" formulas for RGBtoCieLab/RGBtoLCH. Assuming that we are working with tiny ranges of gray, i tried to make the RGBtoCieLab/RGBtoLCH function fixes the range to it´s minimum. So, for example, if a color combination results on a luminance of 6.7e-1, it means that this color luminance is related to the color 2 (Gray = 2), thus i´ll need to decrease the luminance value to it fit´s the minimum of 5.48213387740825841e-1.

But, then a problem came up. When i do this, the resultant values are converted accordly, so it reduces the chroma, but....since the formula also relates Y (From XYZ) to luminance and Y is directly related to the R,G,B pixels, if i reduced the value of luminance i´ll need at end to proportionally reduce the value of R,G,B individually.

The main problem is that when i do the convertion backwards, it will also reduce the value to a new RGB range. For example, inputing RGB = 199, 255, 251, it is in fact this combination: RGB = 198, 254, 250 (if the reduction is proportional) or RGB = 196, 255, 251 (if the reduction is biased on Chroma only).

But...when i convert back those reducted values 198, 254, 250 or 196, 255, 251 they will keep the proportion and reduce again decreasing each pixel value by 1.

This is because the current formulas used to compute the perceptual colorspaces works with Fractions to compute pixels whose values are integer (There´s no such a thing as a red = 154.56564...The value is simply 154 (or 155 according to  the used rounding mode).

To fix that i´ll have to find the limits in between Chroma, Hue and luminance, regardless their claims (false ones) that they are completely separated things.

The correct equation to compute all of that crap is this something like this.

Z = (Chroma*sin(hue)/200) + (Luma+16)/116

Where Z is the Z element of the XYZ colorspace.

But...Z is directly related to R,G,B pixels after multiplying by a matrix, in the form of: Red*Matrix7+Green*Matrix8+Blue*Matrix9

If i reduce proportionally, i´ll necessarily assume that Red, Green and Blue must be proportionally reduced as well....Or one of them will be readjusted to fits to the Z value. In any case, the resultant chroma and hue will be altered on a way that the backwards computation will be impossible because one of the channels will always be reduced on each time we use the function, due to the problem of using integers to compute fraction-ed elements and trying to do the backwards computation where there no such a thing as a fraction-ed red, green or blue.

Not sure if i´m being clear about all of this. The bottom line is that those convertion algorithms suchs  :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

I´m pretty sure that i´ll need to find some limits to work on those things and end to build 2 or 3 more tables. The problem is find them for each color range. :redface: :redface:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

• Member
• Posts: 1363
• building nextdoor
Re: Fast Compare Real8 with SSE
« Reply #29 on: February 03, 2019, 05:06:55 AM »
sounds very complicated

question,if I want to tween between two pixels with colorspace HSV or CieLCH ,or any colorspace with Hue,maybe I need some IF to check direction,so I dont end up with hue pixel1=5 degrees,hue pixel2=350 degrees,tweens between 5 and 350 degrees,when it should tween for example 5degrees,2.5,0,357.5,355,352.5,350 instead?
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)