The MASM Forum

General => The Laboratory => Topic started by: guga on January 29, 2019, 02:31:06 PM

Title: Fast Compare Real8 with SSE and ColorSpaces
Post by: guga on January 29, 2019, 02:31:06 PM
Hi Guys

Somebody suceeded to create a faster way to find a Real8 value among a table of 256 ?

The problems is like thiss:

given a table of 256 Real8 values from 0 to 100 (And all in crescent order from 0 to 100 only)

Btw: R$ stands for Real8 notation ;)

[MyTable:
 Data0: R$  0 ; Index = 0
 Data1: R$ 4.89e-2 ; Index = 1
 Data2: R$ 9.79e-2 ; Index = 2
 Data3: R$ 1.46e-1 ; Index = 3
 Data4: R$ 1.95e-1 ; Index = 4
(...)
 Data78: R$ 9.44
 Data79: R$ 9.67
(...)
 Data150: R$ 33.6 ; Index = 150
 Data151: R$ 34.13 ; Index = 151
(...)
 Data254: R$ 99.18 ; Index = 254
 Data255: R$ 100] ; Index = 255

MyValue = R$ 78.5212


The goal is to find the value "78.5212" among the 256 Real8 Array (In crescent order) and returns the corresponding index of it.

I made a routine for that, but it is kinda slow, though :redface: :redface: :redface: Not sure how to use SSE to make it faster

Code: [Select]

    ; ebx points to the end of the array. So points to "Data255"

    Fpu_If R@MyValue >= R$ebx
        mov eax 255 | jmp L1> ; Found value on the last array (the 255th one). Set it´s index to eax
    Fpu_End_If

    mov ecx 254
    Do

        Fpu_If_And R@MyValue < R$ebx, R@MyValue >= R$ebx-8   ;If value was found in between the current and the previous, we have a match
            mov eax ecx | jmp L1> ; Found value and set it´s index to eax
        Fpu_End_If
        sub ebx 8
        dec ecx
    Loop_Until ecx <= 0
L1:

Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on January 29, 2019, 03:24:21 PM
Option 0 below is probably the fastest solution (project attached):

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)

table REAL8 1234567890.1, 1234567890.2, 1234567890.3, 1234567890.4, 1234567890.5, 1234567890.6

  Init
  fld FP8(1234567890.5)   ; we want to see Match #5
  sub esp, REAL8
  fstp REAL8 ptr [esp]
  movlps xmm1, REAL8 ptr [esp]
  mov esi, offset table
  xor ecx, ecx
  if 1  ; 1=use Fcmp (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1201), 0=use pcmpeqb & friends
       .Repeat
                movlps xmm0, [esi+REAL8*ecx]
                inc ecx
                Fcmp xmm0, xmm1, high   ; high = be reasonably strict; try top or medium
       .Until Zero? || ecx>5
  else
        .Repeat
                movlps xmm0, [esi+REAL8*ecx]
                pcmpeqb xmm0, xmm1
                pmovmskb eax, xmm0
                inc ecx
        .Until ecx>5 || ax==0FFFFh
  endif
  .if Zero?
        Inkey Str$("Match at index %i", ecx)
  .else
        Inkey "No match, sorry"
  .endif
EndOfCode


Option 1, Fcmp, is a bit slower, but the problem here is that the fast solution requires an exact match. Real world examples rarely have such exact matches. What do you need it for?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 29, 2019, 04:26:57 PM
Tks a lot, Jochen :t :t :t

I´ll give it a try It´s for a two functions i´m making to convert from RGB to CieLCH (and vice-versa). I´m also studying how you managed to make a faster sin and cosine function. So far, the functions i built are fast for image processing, but still not fast enough for video. With the help of those optimizations i hope the functions can be improved. If i succeed to make it work as expected i´ll build a library and post it here for you :)

I was a bit surprised about the power of the CieLCH conversions and it´s precision and started to give a try on then :) I was amazed whern found out after doing the maths envolved with those colorspaces that, in fact we can get rid of tons of useless colors combinations that are, in fact, indistinguishable from each other. I already succeeded to reduce the luminance values to fit´s to the range of 0 to 255 and it seems that for chroma and hue it has some restrictions as well. It seems that chrom and hue are related to each other on every luminance value from a table. So it would be much easier to perform the conversions and, at the same time, make the general aspect of a image (or video) be more consistent to what the CieLCH colorspace is actually doing.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 29, 2019, 11:05:18 PM
Hi JJ

I tried the pcmpeqb xmm0, xmm1 but it´s not working. It´s oly exiting the loop when both values are exactly the  same.

The goal is to exit the loop when the value in MyTable is bigger or equal to the inputed value "1234567890.5"

For instance, if the difference between them are more then 8 digits, the routine dn´t exit the loop and don´t find the number.

I also tried with pcmpgtd/movlpd but no sucess. I did this:

Code: [Select]
   mov ebx MYTable

    movlpd xmm1 X@ValuetoFind
    xor ecx ecx
    Do
        movlpd XMM0 X$ebx+ecx*8
        pcmpgtd XMM0 XMM1
        pmovmskb eax XMM0
        If eax <> 0
            dec ecx | jmp L1>
        End_If
        inc ecx
    Loop_Until ecx > 255

Ex:
[MyTable: R$ 0 ; index1
2 ; index2
4 ; index3
6 ; index 4
8] ; index 5

MyNumber = 3

Then it will exit the loop when the value was found  in Between 2 and 4. In, short, when it reaches "4" it will exit the loop, since 4 is bigger then 3 (Also needs to work for equal values as well and not only bigger then)

A question, why comparing also ax with -1 ?
Title: Re: Fast Compare Real8 with SSE
Post by: hutch-- on January 30, 2019, 01:44:10 AM
Guga,

See if "comisd" is more suited to the comparison you need.
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on January 30, 2019, 03:18:42 AM
Hutch is right comissd for conditional jump
Organize in Search tree maybe faster?
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on January 30, 2019, 03:37:13 AM
COMISD works but it is exactly as fast as the pcmpeqb/pmovmskb combi. Besides, the problem remains that equality means for both variants 53 identical bits in the mantissa. If the matches are guaranteed to be exactly equal at REAL8 level, fine, otherwise you need to allow for a delta as in Fcmp (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1201).

Below results for a 50 Million elements array of random doubles in the range -10.0 ... +10.0
The first run has an exact match at the end of the array.
For the second run, the search value was increased by a tiny amount.

The issue is tricky and requires a good task definition. My impression is that working with integers would be a wiser strategy.

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
last elements of double array:
49999999        -1.09845799859612
50000000        -6.69932428747415
50000001        0.0

Now trying to find a match for -6.699324287474150097
Finding match at pos 50000000 (-6.699324287474150097) took 825 ms using Fcmp
Finding match at pos 50000000 (-6.699324287474150097) took 70 ms using pcmpeqb
Finding match at pos 50000000 (-6.699324287474150097) took 70 ms using comisd

Now trying to find a match for -6.698733009397984439
Finding match at pos 20284227 (-6.698732967488462364) took 346 ms using Fcmp
No match using pcmpeqb, sorry
No match using comisd, sorry
Title: Re: Fast Compare Real8 with SSE
Post by: AW on January 30, 2019, 04:22:17 AM
I don't think SSE will be advantageous, what we need is search using Binary Search. It will get there in a maximum of 9 iterations.

Code: [Select]
H:\temp>binsearch
Searching for 937.5896481215857 (position 237)
Found 937.5896481215857 in position 237 in 7 iterations

H:\temp>binsearch
Searching for 477.70622882778406 (position 114)
Found 477.70622882778406 in position 114 in 8 iterations

H:\temp>binsearch
Searching for 945.06668294320502 (position 246)
Found 945.06668294320502 in position 246 in 8 iterations

H:\temp>binsearch
Searching for 532.82265694143496 (position 122)
Found 532.82265694143496 in position 122 in 8 iterations

H:\temp>binsearch
Searching for 995.36118655964844 (position 255)
Found 995.36118655964844 in position 255 in 9 iterations

H:\temp>binsearch
Searching for 552.75124362926124 (position 131)
Found 552.75124362926124 in position 131 in 6 iterations

H:\temp>binsearch
Searching for 526.77999206518757 (position 140)
Found 526.77999206518757 in position 140 in 8 iterations

H:\temp>binsearch
Searching for 98.544267097994933 (position 16)
Found 98.544267097994933 in position 16 in 8 iterations

H:\temp>binsearch
Searching for 617.23685415204318 (position 149)
Found 617.23685415204318 in position 149 in 7 iterations

H:\temp>binsearch
Searching for 205.78630939664907 (position 52)
Found 205.78630939664907 in position 52 in 8 iterations

H:\temp>binsearch
Searching for 672.10913418988616 (position 184)
Found 672.10913418988616 in position 184 in 8 iterations

H:\temp>binsearch
Searching for 250.09918515579699 (position 60)
Found 250.09918515579699 in position 60 in 8 iterations

H:\temp>binsearch
Searching for 754.02081362346257 (position 193)
Found 754.02081362346257 in position 193 in 7 iterations

H:\temp>binsearch
Searching for 277.16910306100652 (position 69)
Found 277.16910306100652 in position 69 in 7 iterations

H:\temp>binsearch
Searching for 787.62169255653555 (position 202)
Found 787.62169255653555 in position 202 in 8 iterations

H:\temp>binsearch
Searching for 320.99368266853844 (position 78)
Found 320.99368266853844 in position 78 in 8 iterations

H:\temp>binsearch
Searching for 846.88863795892212 (position 210)
Found 846.88863795892212 in position 210 in 8 iterations

H:\temp>binsearch
Searching for 350.0167851802118 (position 87)
Found 350.0167851802118 in position 87 in 5 iterations

The attachment contains source in C and .exe . It is easy to convert to ASM.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 30, 2019, 04:26:30 AM
Many Tks, Steve and JJ :)

Steve, i gave a try on comisd and comis but, it seems that they are slower then the regular fcomip accordlying to https://stackoverflow.com/questions/37766131/intel-x86-64-assembly-compare-signed-double-precision-floats.
So i did a variation of the macro i use for Fpu Comparitions, and this worked but, i´m not sure yet about speed yet :icon_rolleyes:

Code: [Select]
(...)
    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R$ecx

    ; After computing the values to be found, let´s scan on our look up table

    lea ebx D$esi+WS_Matrix.KFactorMapDis ; Points to the look up table
    mov edi D@Red ; Output result in "Red" Variable
    fld R@TmpRedDis | fmul R$Float100 | fstp R@TmpRedDis

    fld R@TmpRedDis
    xor ecx ecx
    Do
        fld R$ebx+ecx*8 | fcomip ST0 ST1 | jna L0>
            ffree ST0
            dec ecx | jmp L1>
        L0:
        inc ecx
    Loop_Until ecx > 255
    dec ecx
L1:
    mov D$edi ecx


JJ, i can´t work with integer, unfortunatelly. Not on this part of the code. What returns in "R@TmpRedDis" is, in fact, the result of a computation of the red pixel, so it is a fraction from 0 to 100. Each value on the table i´m using contains the result of "TmpRedDis" whose value is computed on a power basics. So, TempRed = XXX*YYY +1/(z^C) etc etc. The good thing is that the table contains 256 Real8 which values are ordered from 0 to 100 (in real8)...so, 0.04456456, 0.689798, 1.23258., 1.989, ....100 Which makes the searching  faster to scan, i presume.

I´m trying to replace the function to findd the proper value of TempRed with a simple scan to it´s look up table rather then using the regular function to compute it  directly.

The direct function uses a power of 2.4 to retrieve the correct (final) value of RedPixel (originated from TmpRedDis).

The "direct" function is like this:

Code: [Select]
(...)

    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R$ecx

    ; After computing the values to be found, let´s do the computations to find red, Green and Blue using "Power of" maths

    lea ecx D@TmpRedDis
    .Fpu_If R$ecx > R$esi+WS_Matrix.GammaDecode.TresholdDis
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$esi+WS_Matrix.GammaDecode.GammaDis | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$esi+WS_Matrix.GammaDecode.OffsetPlusDis | fsub R$esi+WS_Matrix.GammaDecode.OffsetDis
    .Fpu_Else
         ; Here is faster, but happens only when "index" (Pixel color) is (in general) smaller then 12.12354(or something..this value is not fixed, although it is small). So, the odds of a pixel falls on this jmp routine are 12/256.
        fld R$ecx | fmul R$esi+WS_Matrix.GammaDecode.SlopeDis
    .Fpu_End_If
    mov ecx D@Red | fmul R$Float255 | fistp F$ecx

See ? The function makes usage of logaritm FPU, scaling etc etc...All of this seems to be slower then a loop scan on the LookupTable, though.

Didn´t have time to test for speed yet, but it seems that a direct computation is slower then a simple scan on the  256 Real8 values.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 30, 2019, 04:27:49 AM
Hi Aw, a binary search ? Hmm...indeed..I´l take a look. Many tks :)
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on January 30, 2019, 05:41:26 AM
I am trying something similar, but enhance it with lookup one double and the following entry in the data and calculate mean value of two,or more advanced calculation based also on which number it is closer affect it more than the one that its farther to
 
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on January 30, 2019, 06:44:28 AM
OK, now I see it's an entirely different story. It could be done with a simple cmp eax, [esi+8*ecx] - assuming that the low DWORD is unique in that table of 256.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 30, 2019, 12:20:26 PM
Hi Jochen.

I´m not sure if the Low Dword values on the table are unique. I´ll take a look at them :t :t :t. But...even if they are unique, how a look up table containing only the low dwords  will work better ?

The main routine to calculate those values used in the KFactorMap are computed way before the main converter (RGBtoCieLCH and CieLCHtoRGB) starts. The values are calculated from the gamma and offset  related to each monitor/camera/colorspace etc (Adobe1998, Bruce, SMTP etc etc)

Code: [Select]
    ; and create our KFactor Map for each color. Put the resultant values on the main Structure to form the table

    xor ecx ecx
    Do
        call FindKFactor ecx, D@pOutStruct
        inc ecx
    Loop_Until ecx > 255

And the main routine is at:
Code: [Select]

; RGBtoCieLCH_Ex
Proc FindKFactor:
    Arguments @Color, @pMatrix
    Structure @TempStorage 64, @pKfactorColorDis 0, @pKFactorWhiteRefXDis 8, @pKFactorWhiteRefYDis 16, @pKFactorWhiteRefZDis 24, @TmpGreenDis 32, @TmpBlueDis 40, @pAFactorDis 48, @pBFactorDis 56
    Uses esi, ebx, ecx, edx, edi

    finit

    lea ecx D@pKfactorColorDis | fild F@Color | fmul R$FloatOne_255 | fstp R$ecx

    mov esi D@pMatrix

    ; Points to the Start of KfactorMap array
    mov ebx esi | add ebx WS_Matrix.KFactorMapDis ; Points to the start of the KFactor Map array in WS_Matrix Structure
    xor edx edx
    mov eax 8;Size_Of_KFactorMap
    mul D@Color
    add ebx eax


    ; newColorRed = ((R@Red+0.055)/1.055)^2.4 ; where2.4 = gamma, 0.055 = Offset, 1.055 = Offset+1. Better precalculating them as i did.
    lea ecx D@pKfactorColorDis
    .Fpu_If R$ecx > R$esi+WS_Matrix.GammaEncode.TresholdDis
        fld R$esi+WS_Matrix.GammaEncode.GammaDis | fld R$esi+WS_Matrix.GammaEncode.OffsetDis | fadd R$ecx | fmul R$esi+WS_Matrix.GammaEncode.OffsetPlusDis | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
    .Fpu_Else
        fld R$ecx | fdiv R$esi+WS_Matrix.GammaEncode.SlopeDis
    .Fpu_End_If
    fmul R$Float100 | fstp R$ebx;+KFactorMap.ColorDis ; Original i wrote it as KFactorRed, KfactorGreen, KfactorBlue, but, in fact it is only 1 color/pixel to check. So... 256 in total

EndP


These functions are giving me a tremendous headache. I just found out that the threshold must be analysed (On the main convertiopn function CieLCHtoRGB and not the above one;) ) . When i convert colors such as Red = 12, Green = 14, Blue = 0, the resultant vaules after tyhe multiplication of the matrices are below the "WS_Matrix.GammaDecode.Treshold". So i´ll need to try to find a integer value and put that liek another members of the structure i´m using to create the table. Once i found if the threshold can be converted to a integer, it would be eaiser to make the check doing the convertion back after the matrix multiplication.

Something like:

Code: [Select]
mov eax D$esi+ThresholdInteger
fld R@TmpRedDis | fmul R$esi+WS_Matrix.GammaDecode.SlopeDis | fmul R$Float255 | fistp D@TempColorCheck
If D@TempColorCheck < eax
      return D@TempColorCheck ; the color of this pixel was found under the treshold.
Else
     ; perform check on the loop up table (Binary search, or whatever other method)
EndIf


I hope i can find a way to create a integer related to this threshold. The only problem i see is that we will need extra coding (mul by slope and mul by 255) to pixels that are found in less then 13% of the cases.

A true pain make those functions works correctly and to get things worst, all the papers i read so far are utterly complicated and tends to put different equations to create the very same thing, rather then simply trying to merge those equations together and then try to simplify all of this.

At least i´ve got the function works (well..kind of :icon_mrgreen: :icon_mrgreen:) and the only thing that needs to be done is optimize all of this.
Title: Re: Fast Compare Real8 with SSE
Post by: AW on January 30, 2019, 08:31:57 PM
All right, finally a MASM solution in this thread (which is derived from my initial C solution  :lol:) adjusting for the requirement that the search value doesn't need to be exact.

So, I adjusted the original C algo like this:

Code: [Select]
int binarySearch(double sortedArray[], int first, int last, double key, int *iterations)
{

while (first <= last) {
int mid = first + (last-first) / 2;
(*iterations)++;
if (key == sortedArray[mid])
return mid;
else if (key < sortedArray[mid])
last = mid - 1;
else {
if (key < sortedArray[mid + 1])
return mid;
else
first = mid + 1;

}
}
return -1;
}

and can be translated into an ASM module like this, after deoptimizing as needed the compiler ASM output:
Code: [Select]
.model flat, C
option casemap :none

.code

binSearch proc uses esi edi ebx sortedArray:ptr, first:dword, last:dword, _key:real8, iteration:ptr
fld QWORD PTR _key
mov edi, last
mov esi, first
mov ebx, iteration

@startLoop:
inc dword ptr[ebx]

; if (key == sortedArray[mid])
fld ST(0) ;  duplicate stack
mov eax, edi
sub eax, esi
cdq
sub eax, edx
mov ecx, eax
sar ecx, 1
add ecx, esi
mov edx, sortedArray
fld QWORD PTR [edx+ecx*8]
fucompp
fnstsw ax
test ah, 68 ; 00000044H
jnp @exit
fcom QWORD PTR [edx+ecx*8]
fnstsw ax
test ah, 5
jp SHORT @checkElse

lea edi, DWORD PTR [ecx-1]
jmp SHORT @endLoop

@checkElse:
fcom QWORD PTR [edx+ecx*8+8]
fnstsw ax
test ah, 5
jnp SHORT @exit
lea esi, DWORD PTR [ecx+1]

@endLoop:
cmp esi, edi
jle SHORT @startLoop

@exit:
fstp ST(0)
mov eax, ecx

ret
binSearch endp

end

So, lets test 100 times:

Code: [Select]
Searching for 393.98684652241582 (position 104)
Found 392.98684652241582 in position 104 in 8 iterations
Searching for 1.488296151615955 (position 0)
Found 0.48829615161595508 in position 0 in 8 iterations
Searching for 343.86123233741267 (position 91)
Found 340.86123233741267 in position 91 in 6 iterations
Searching for 866.07766960661638 (position 218)
Found 865.07766960661638 in position 219 in 6 iterations
Searching for 768.01458784752947 (position 198)
Found 766.01458784752947 in position 198 in 8 iterations
Searching for 457.86919766838588 (position 125)
Found 457.86919766838588 in position 125 in 7 iterations
Searching for 388.74477370525221 (position 102)
Found 388.74477370525221 in position 102 in 8 iterations
Searching for 180.82216864528337 (position 51)
Found 180.82216864528337 in position 51 in 6 iterations
Searching for 427.22861415448472 (position 114)
Found 427.22861415448472 in position 114 in 8 iterations
Searching for 517.38947721793272 (position 137)
Found 514.38947721793272 in position 137 in 7 iterations
Searching for 480.00430310983609 (position 127)
Found 477.00430310983609 in position 129 in 7 iterations
Searching for 326.21442304757835 (position 82)
Found 322.21442304757835 in position 83 in 6 iterations
Searching for 645.95684682760088 (position 162)
Found 641.95684682760088 in position 164 in 8 iterations
Searching for 950.56920072023684 (position 239)
Found 947.56920072023684 in position 239 in 4 iterations
Searching for 825.06790368358406 (position 212)
Found 825.06790368358406 in position 212 in 8 iterations
Searching for 801.96746726889853 (position 206)
Found 797.96746726889853 in position 208 in 8 iterations
Searching for 598.49137852107299 (position 154)
Found 597.49137852107299 in position 155 in 6 iterations
Searching for 182.50276192510759 (position 49)
Found 178.50276192510759 in position 52 in 8 iterations
Searching for 910.81780449842836 (position 229)
Found 909.81780449842836 in position 229 in 7 iterations
Searching for 178.50276192510759 (position 49)
Found 178.50276192510759 in position 49 in 7 iterations
Searching for 193.48023926511428 (position 54)
Found 192.48023926511428 in position 54 in 8 iterations
Searching for 10.19525742362743 (position 2)
Found 6.1952574236274298 in position 3 in 6 iterations
Searching for 27.130954924161504 (position 7)
Found 27.130954924161504 in position 7 in 5 iterations
Searching for 441.32816553239542 (position 120)
Found 441.32816553239542 in position 120 in 8 iterations
Searching for 394.19336527603991 (position 103)
Found 392.19336527603991 in position 104 in 8 iterations
Searching for 791.80480361339153 (position 202)
Found 787.80480361339153 in position 203 in 6 iterations
Searching for 911.36002685628841 (position 228)
Found 909.36002685628841 in position 229 in 7 iterations
Searching for 581.5159764397107 (position 151)
Found 579.5159764397107 in position 151 in 5 iterations
Searching for 660.84780419324318 (position 173)
Found 660.84780419324318 in position 173 in 7 iterations
Searching for 145.35254982146674 (position 38)
Found 144.35254982146674 in position 38 in 8 iterations
Searching for 969.23734244819479 (position 248)
Found 969.23734244819479 in position 248 in 8 iterations
Searching for 961.31171605578777 (position 242)
Found 958.31171605578777 in position 243 in 6 iterations
Searching for 642.95684682760088 (position 162)
Found 641.95684682760088 in position 162 in 8 iterations
Searching for 392.74477370525221 (position 102)
Found 388.74477370525221 in position 103 in 5 iterations
Searching for 93.426343577379683 (position 21)
Found 90.426343577379683 in position 22 in 8 iterations
Searching for 909.81780449842836 (position 229)
Found 909.81780449842836 in position 229 in 7 iterations
Searching for 421.13409833063753 (position 111)
Found 418.13409833063753 in position 111 in 4 iterations
Searching for 54.386059144871361 (position 13)
Found 50.386059144871361 in position 13 in 7 iterations
Searching for 899.67143772698148 (position 225)
Found 897.67143772698148 in position 225 in 7 iterations
Searching for 222.16763817255165 (position 61)
Found 221.16763817255165 in position 61 in 7 iterations
Searching for 331.21738334299755 (position 84)
Found 331.21738334299755 in position 84 in 8 iterations
Searching for 335.21738334299755 (position 84)
Found 331.21738334299755 in position 87 in 5 iterations
Searching for 324.61827448347424 (position 83)
Found 323.61827448347424 in position 83 in 6 iterations
Searching for 716.20126956999422 (position 188)
Found 715.20126956999422 in position 190 in 8 iterations
Searching for 144.33121738334302 (position 37)
Found 141.33121738334302 in position 37 in 7 iterations
Searching for 660.14795373393963 (position 168)
Found 656.14795373393963 in position 171 in 6 iterations
Searching for 208.93890194402906 (position 57)
Found 205.93890194402906 in position 57 in 7 iterations
Searching for 593.89019440290531 (position 153)
Found 593.89019440290531 in position 153 in 7 iterations
Searching for 787.87502670369577 (position 200)
Found 784.87502670369577 in position 202 in 8 iterations
Searching for 147.24469740897854 (position 39)
Found 146.24469740897854 in position 39 in 5 iterations
Searching for 678.17520676290167 (position 181)
Found 677.17520676290167 in position 181 in 7 iterations
Searching for 998.74163029877616 (position 254)
Found 997.74163029877616 in position 255 in 9 iterations
Searching for 699.79149754325999 (position 184)
Found 695.79149754325999 in position 184 in 8 iterations
Searching for 150.24469740897854 (position 39)
Found 146.24469740897854 in position 40 in 8 iterations
Searching for 714.77401043733028 (position 187)
Found 714.77401043733028 in position 187 in 6 iterations
Searching for 516.38947721793272 (position 137)
Found 514.38947721793272 in position 137 in 7 iterations
Searching for 497.91866817224644 (position 132)
Found 494.91866817224644 in position 132 in 8 iterations
Searching for 663.00750755333104 (position 175)
Found 662.00750755333104 in position 175 in 4 iterations
Searching for 578.77642139957891 (position 149)
Found 577.77642139957891 in position 149 in 7 iterations
Searching for 793.84954374828328 (position 203)
Found 789.84954374828328 in position 204 in 8 iterations
Searching for 323.71190527054659 (position 80)
Found 319.71190527054659 in position 83 in 6 iterations
Searching for 979.27082125308993 (position 251)
Found 978.27082125308993 in position 251 in 6 iterations
Searching for 360.4256721701712 (position 95)
Found 356.4256721701712 in position 95 in 3 iterations
Searching for 791.95532090212714 (position 204)
Found 791.95532090212714 in position 204 in 8 iterations
Searching for 325.61827448347424 (position 83)
Found 323.61827448347424 in position 83 in 6 iterations
Searching for 997.99496444593649 (position 253)
Found 994.99496444593649 in position 254 in 8 iterations
Searching for 540.79717398602259 (position 139)
Found 537.79717398602259 in position 141 in 7 iterations
Searching for 501.13733329264198 (position 133)
Found 500.13733329264198 in position 134 in 8 iterations
Searching for 340.86330759605698 (position 88)
Found 336.86330759605698 in position 91 in 6 iterations
Searching for 839.77782525101475 (position 214)
Found 839.77782525101475 in position 214 in 8 iterations
Searching for 349.44611957152011 (position 94)
Found 346.44611957152011 in position 94 in 8 iterations
Searching for 383.97677541428874 (position 100)
Found 382.97677541428874 in position 100 in 8 iterations
Searching for 230.68230231635485 (position 62)
Found 229.68230231635485 in position 63 in 2 iterations
Searching for 362.13263344218268 (position 97)
Found 362.13263344218268 in position 97 in 7 iterations
Searching for 429.22861415448472 (position 114)
Found 427.22861415448472 in position 115 in 6 iterations
Searching for 791.84954374828328 (position 203)
Found 789.84954374828328 in position 203 in 6 iterations
Searching for 443.32816553239542 (position 120)
Found 441.32816553239542 in position 121 in 7 iterations
Searching for 661.09402752769552 (position 169)
Found 657.09402752769552 in position 173 in 7 iterations
Searching for 182.71639149143957 (position 50)
Found 178.71639149143957 in position 52 in 8 iterations
Searching for 563.93731498153636 (position 146)
Found 561.93731498153636 in position 146 in 8 iterations
Searching for 540.29969176305428 (position 141)
Found 540.29969176305428 in position 141 in 7 iterations
Searching for 420.13409833063753 (position 110)
Found 418.13409833063753 in position 111 in 4 iterations
Searching for 366.13263344218268 (position 97)
Found 362.13263344218268 in position 97 in 7 iterations
Searching for 483.0490432447279 (position 129)
Found 479.0490432447279 in position 130 in 8 iterations
Searching for 587.92486342967004 (position 152)
Found 585.92486342967004 in position 152 in 8 iterations
Searching for 649.0768456068605 (position 165)
Found 646.0768456068605 in position 165 in 7 iterations
Searching for 568.59953611865603 (position 147)
Found 565.59953611865603 in position 147 in 6 iterations
Searching for 842.77782525101475 (position 214)
Found 839.77782525101475 in position 214 in 8 iterations
Searching for 788.99710074159975 (position 201)
Found 784.99710074159975 in position 202 in 8 iterations
Searching for 970.03289895321507 (position 244)
Found 966.03289895321507 in position 249 in 7 iterations
Searching for 669.89046906949068 (position 177)
Found 666.89046906949068 in position 178 in 8 iterations
Searching for 764.82436597796561 (position 197)
Found 764.82436597796561 in position 197 in 7 iterations
Searching for 910.94906460768459 (position 227)
Found 906.94906460768459 in position 229 in 7 iterations
Searching for 216.10364085818048 (position 58)
Found 212.10364085818048 in position 60 in 8 iterations
Searching for 446.80727561265911 (position 122)
Found 444.80727561265911 in position 123 in 6 iterations
Searching for 964.17334513382366 (position 243)
Found 960.17334513382366 in position 243 in 6 iterations
Searching for 717.20126956999422 (position 188)
Found 715.20126956999422 in position 190 in 8 iterations
Searching for 11.169774468214973 (position 4)
Found 11.169774468214973 in position 4 in 8 iterations
Searching for 995.99496444593649 (position 253)
Found 994.99496444593649 in position 253 in 7 iterations
Searching for 976.47230445265052 (position 250)
Found 972.47230445265052 in position 250 in 8 iterations

I don't believe that a linear search outperforms a binary search even with only 256 elements, anyway that is a lot of fun too.  :biggrin:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 30, 2019, 10:11:36 PM
Many Tks AW  :t :t :t. I´ll take a look :)
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on January 30, 2019, 10:30:24 PM
given a table of 256 Real8 values from 0 to 100 (And all in crescent order from 0 to 100 only)

I've given it a try for the naive scan method:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10000000 random searches performed in 1157 ms

The source:

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  Dim table() As REAL8          ; create an array
  Let edi=New$(256*DWORD)       ; and a DWORD table
  push edi                      ; we need it later
  Rand()
  For_ ecx=0 To 255
        Rand(0, 100, table(ecx))
  Next
  ArraySort table()             ; sort double array ascending
  Open "O", #1, "TheTable.dat"  ; save the array to disk
  ArrayStore #1, table()
  Close
  For_ ecx=0 To 255
        movlps xmm0, table(ecx) ; get a value from the table
        .if ecx<5 || ecx>=255-4
                .if Zero?
                                PrintLine "..."
                .endif
                Print Str$(ecx), Str$("\t%9f\n", f:xmm0)        ; print the first and last values
        .endif
        movd eax, xmm0
        stosd                   ; store the double's low DWORD
  Next
  pop edi
  loops=10000000
  push loops-1
  PrintCpu 0
  NanoTimer()
  .Repeat
        mov ecx, 256
        mov ebx, Rand(ecx)      ; test a random search position
        mov eax, DWORD PTR table(ebx)
        push edi
        push ecx
        repnz scasd             ; this is the search algo ;-)
        pop edx
        sub ecx, edx
        not ecx
        pop edi
        .if ecx!=ebx            ; check if the found position is the expected one
                Print Str$("\n## Mismatch at pos %i", ecx), Str$("<>%i ##\n\n", ebx)
                .Break
        .endif
        dec stack
  .Until Sign?
  Print Str$("%i random searches performed in ", loops), NanoTimer$()
EndOfCode


The binary search could be faster but also trickier. You would have to compare the HIGH dwords, and they are much more likely to have duplicates.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 31, 2019, 06:34:41 PM
Tks a lot, Jochen and AW.

JJ, I´ll give more tests but it seems that your idea on using the low dword worked. I also came up with a routine biased on your´s and on AW, can you please benchmark this for me and see if it is working ?

Compare Double fpu (the low dword, in fact).
Code: [Select]

Proc FastCompare4_Double:
    Arguments @pTable, @Value, @Size
    Uses esi, ebx, ecx, edi

        mov edi D@Size
        xor eax eax
        shl edi 1 ; always mul by 2 at the start. A Real8 is double the size of a dword.
        mov esi D@pTable
L1:
        ; half = edi
        shr edi 1
        lea ecx D$edi+eax | mov ebx D$esi+ecx*4+4; Otherlow = low+half

        If D@Value = ebx
            mov eax ecx ; Commented out the inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX
            shr eax 1
            ExitP
        End_If
        cmovg eax ecx
       
        test edi edi | jne L1<
EndP

___________________________________

Example of usage:

[TestingTable1: D$ 0, 0
TestingTable2: D$ 0, 1
TestingTable3: D$ 0, 2
TestingTable4: D$ 0, 3
TestingTable5: D$ 0, 4
TestingTable6: D$ 0, 5
TestingTable7: D$ 0, 6
TestingTable8: D$ 0, 7
TestingTable9: D$ 0, 8
....
TestingTable255: D$ 0, 254
TestingTable256: D$ 0, 255]

call FastCompare4_Double TestingTable1, 177, 256

Compare Dwords only

Code: [Select]
Proc FastCompare4:
    Arguments @pTable, @Value, @Size
    Uses esi, ebx, ecx, edi


    mov esi D@pTable
    xor eax eax ; eax = low
     mov edi D@Size
L1:
        ; half = edi
        shr edi 1
        lea ecx D$edi+eax | mov ebx D$esi+ecx*4; Otherlow = low+half

        If D@Value = ebx
            mov eax ecx; Commented out the inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX
            ExitP
        End_If
        cmovg eax ecx

        test edi edi | jne L1<

EndP

_______________________________________

Example of usage:

[MyGugaTestTable1: D$ 0
MyGugaTestTable2: D$ 1
MyGugaTestTable3: D$ 2
MyGugaTestTable4: D$ 3
MyGugaTestTable5: D$ 4
MyGugaTestTable6: D$ 5
MyGugaTestTable7: D$ 6
MyGugaTestTable8: D$ 7
MyGugaTestTable9: D$ 8
(...)
MyGugaTestTable254: D$ 253
MyGugaTestTable255: D$ 254
MyGugaTestTable256: D$ 255]

call FastCompare4 MyGugaTestTable1, 177, 256


I hope that those functions are faster then the logarithm computation used on the original function :)
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on January 31, 2019, 06:47:04 PM
can you please benchmark this

I'm afraid my assembler won't accept the syntax... do you have a translator?

Btw that table of REAL8s - is that always the same numbers, or are they randomly created? If they are the same, testing once if there are duplicates would be sufficient. If not, there is a hypothetical chance to find a duplicate, and that would require one extra run.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on January 31, 2019, 11:19:59 PM
Hi JJ. Tks :)


I believe the syntax is like this

Code: [Select]

pTable = Pointer to the table of 256 Real8
Value = The value to be found. Any Integer
Size = Integer. Size of the table. Dummy, i know...since we are testing only 256 Real8, but i did like that to remember later of what i was doing :)

FastCompare4_Double proc pTable: Dword, Value:Dword, Size:Dword

                push    ebp
               mov     ebp, esp
                push    esi
                push    ebx
                push    ecx
                push    edi
                mov     edi, Size ; The size of a Real8 is always 2x the size of an integer. So, we need to double it´s size here so we can compare the value with only it´s low dword later on mov ebx [esi+ecx*4+4]
                xor     eax, eax
                shl     edi, 1
                mov     esi, pTable ; or.. offset pTable (I forgot the masm syntax. But here is a pointer not the value itself.

loc_45AEA1:
                shr     edi, 1
                lea     ecx, [eax+edi]
                mov     ebx, [esi+ecx*4+4]
                cmp     Value, ebx ; The valuye to be found is compared to the Low Dword from our Real8 table
                jnz     short loc_45AEB8
                mov     eax, ecx
                shr     eax, 1 ; Note: We are retrieving only the index and not the actual pos. For Real8 computations, divide the result by 2 to it retrieve the proper index. To retrieve the actual pos simply increase this result by 1
                jmp     loc_45AEC4 ; Since we found the value we can exit the function :)

loc_45AEB8:
                cmovg   eax, ecx
                test    edi, edi
                jnz     short loc_45AEA1
                mov     eax, 0FFFFFFFFh ; Nothing was found. Return -1

loc_45AEC4:
                pop     edi
                pop     ecx
                pop     ebx
                pop     esi
                mov     esp, ebp
                pop     ebp
                retn    0Ch
FastCompare4_Double endp

; ---------------------------------------------------------------------------



About the Real8 values...they are all the same. They are created from a computation of gamma and offset, whoose values are fixed for every color model, with the exception of only few models (sRGB HDTV and Adobe RGB 1998) that uses these, respectivelly: Gamma: 2.19921875, Offset: 0.055 and Gamma: 2.2, Offset: 0.099). All other models uses the same values for gamma and offset (2.2 for gammaa and 0.055 for offset. Other few models uses values as 1.8 for gamma).

The resultant values of Real8 (i named them as "Kfactor") on the table are given by the following equation:

ColorNormnalized = Color/255 ; The table of 256 Real8 is created using a range of integers from Color = 0 to 255. I mean, i created a function and insert integer values (not normalized) from 0 to 255 to it compute the Kfactor values for each one of the 256 colors.

KFactor = [ColorNormalized+Offset)/(1+Offset))^Gamma]  * 100 ; If ColorNormalized (Color/255) is bigger then Treshold
KFactor = (ColorNormalized/Slope)  * 100 ; If ColorNormalized (Color/255) is smaller or equal to Treshold

Threshold  and slope are calculated from Gamma and offset like this:

Threshold  = Offset/(Gamma-1)
Slope = [Offset * ( (Gamma-1)^(Gamma-1) )] * [(Offset+1)/(Offset*Gamma)]^Gamma

The only fixed values are, offset and gamma that the user must insert to create threshold and slope for the KFactor computation.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 01, 2019, 01:02:41 AM
Note:

The values of gamma varies from 1.1 to 2.4
Offset is a tiny value. 0.055 (or 0.099 for HDTV only). The values seems to be x/1000 where x >0 < 100, but i´m not sure how they are achieved. I stablished the limits for gamma to avoid division by zero. But for offset i´ll take a look later. In cases where offset are 0 the correct would result in a slope of 1, if i assume that 0/0 = 1. But i don´t remember if for gamma and offset it do exists a formula to retrieve those values or they are only made by the user/manufacturer of a monitor for example.

The formulas can be found here:
https://en.wikipedia.org/wiki/SRGB
http://www.marcelpatek.com/color.html
http://hem.bredband.net/egostudios/gamma_management.htm
http://www.marcelpatek.com/gamma.html
https://web.archive.org/web/20130718042406/http://www.arcsynthesis.org/gltut/Illumination/Tut12%20Monitors%20and%20Gamma.html

Marcel Patek´s  formulas seems incorrect as mentioned earlier.
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 01, 2019, 03:03:44 AM
Hi Guga,

I've tried my luck but your algo returns always -1 (and you haven't been generous with comments). This is supposed to be binary search, right? What's wrong? Does it work on your end?

The parameters are passed correctly:
Code: [Select]
mov edi, _Size ; 100h, ok
xor eax, eax
mov esi, pTable ; OK
; ^ ^ at this point: edi=100h, esi points to REAL8 values ^ ^

Code: [Select]
option prologue:none
option epilogue:none
if 0
  pTable = Pointer to the table of 256 Real8
  Value = The value to be found. Any Integer
  Size = Integer. Size of the table. Dummy, i know...since we are testing only 256 Real8, but i did like that to remember later of what i was doing :)
endif
;   invoke FastCompare4_Double, addr table(0), DWORD PTR table(ebx), ecx
FastCompare4_Double proc pTable: Dword, Value:Dword, _Size:Dword
push ebp
mov ebp, esp
push esi
push ebx
push ecx
push edi
mov edi, _Size ; 100h, ok
xor eax, eax
shl edi, 1
mov esi, pTable ; OK or.. offset pTable (I forgot the masm syntax. But here is a pointer not the value itself.
loc_45AEA1:
shr edi, 1
lea ecx, [eax+edi]
mov ebx, [esi+ecx*4+4]
cmp Value, ebx ; or.. offset Value (I forgot the masm syntax. But here is a pointer not the value itself.
jnz short loc_45AEB8
mov eax, ecx
shr eax, 1
jmp loc_45AEC4

loc_45AEB8:
cmovg eax, ecx
test edi, edi
jnz short loc_45AEA1
mov eax, 0FFFFFFFFh

loc_45AEC4:
pop edi
pop ecx
pop ebx
pop esi
mov esp, ebp
pop ebp
retn 0Ch
FastCompare4_Double endp
OPTION PROLOGUE:PrologueDef
OPTION PROLOGUE:PrologueDef
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 01, 2019, 04:35:55 AM
Oops..i forgot to reinsert the comments. Sorry  :redface: :redface: :redface: :redface:I removed them after optimizing. Here is the non optimized version with comment.

Here both are working, but only when it finds a match of the exact value. When it tries to find a value in between 2 it returns also -1

For instance, when the table is
[0, 5, 9, 12]
If the value to be found is 7, it will return -1.

I forgot to adapt the routine on the cases for the next value be bigger. Ex: 7 is in between 5 and 9. So it must return index 1

Here is the non optimized masm version. (For the exact match, i mean, and not the other to searching for values in between (i´m working on it, right now and see if i can be able to use the cmovge/cmovle as well to optimize:) )

Code: [Select]
FastCompare3_Double proc near
Probe           = dword ptr -1Ch
mid             = dword ptr -14h
Low             = dword ptr -10h
half            = dword ptr -8
CurSize         = dword ptr -4
pTable          = dword ptr  8
MyValue         = dword ptr  0Ch
Size            = dword ptr  10h

                push    ebp
                mov     ebp, esp
                sub     esp, 1Ch
                push    esi
                push    ecx
                push    edx
                mov     [ebp+Low], 0 ; Start LowVariable = 0
                mov     eax, [ebp+Size]
                shl     eax, 1
                mov     [ebp+CurSize], eax ; The Size always is mul by 2 because Real8 is 2x bigger then a Dword. So, current size starts with 512  (256*2)
                mov     esi, [ebp+pTable]  ; Pointer to our table of real8

loc_45AF7E:                             ; CODE XREF: FastCompare3_Double+6A↓j
                cmp     [ebp+CurSize], 0  ; On each loop the current size is being decreased. Size is used as a counter.
                jbe     loc_45AFCF
                mov     ecx, [ebp+CurSize]
                shr     ecx, 1
                mov     [ebp+half], ecx  ; half = cursize/2. half means searching for half of the table
                mov     ecx, [ebp+Low]
                add     ecx, [ebp+half]
                mov     [ebp+mid], ecx ; Mid is what will return. Here it is Mid = Low+CurSize/2
                mov     eax, [esi+ecx*4+4] ; esi will point to the middle of the table related to the half of the current position scnanned
                mov     [ebp+Probe], eax   ; The value from the table to check
                mov     eax, [ebp+MyValue] ; The inputed value
                cmp     eax, [ebp+Probe]
                jnz     short loc_45AFB4
                mov     eax, [ebp+mid] ; If MyValue is equal to the one being checked, we found the match.
                shr     eax, 1         ; since we are working with Real8, the original size was doubled. So we need to divide back by 2 to get the proper index. It will return the index. To return the pos, simply add 1 after the shr eax...
                jmp     loc_45AFCF     ; Exit our loop
; ---------------------------------------------------------------------------
;                jmp     short loc_45AFC1 ; Left over by RosAsm Else_If macro. Can be commented out
; ---------------------------------------------------------------------------

loc_45AFB4:                             ; CODE XREF: FastCompare3_Double+46↑j
                cmp     eax, [ebp+Probe] ; If the value from the table is bigger then My Value, it means we advanced the position. We lost he spot in forwards. Ex: MyValue = 5. Table ..... 2, 3, 5, 6 (6 = Probe). So Probe is forward the value to be found
                jbe     short loc_45AFBE
                mov     eax, [ebp+mid]    ; If we are forward it, get back to the previous position.
                jmp     short loc_45AFC1
; ---------------------------------------------------------------------------

loc_45AFBE:                             ; CODE XREF: FastCompare3_Double+57↑j
                mov     eax, [ebp+Low]  ; If we are here, it means that MyValue is located before the Probe. So, we are backwards. Ex:  MyValue = 5. Table ..... 2, 3 (Probe), 5, 6 . So Probe is before the value to be found

loc_45AFC1:                             ; CODE XREF: FastCompare3_Double+52↑j
                                        ; FastCompare3_Double+5C↑j
                mov     [ebp+Low], eax   ; Save the new position
                mov     eax, [ebp+half]  ; Current size is always the half of what is beeing scanned. Perform the look if we are not over (So, if Curize is bigger then 0)
                mov     [ebp+CurSize], eax
                jmp     loc_45AF7E
; ---------------------------------------------------------------------------

loc_45AFCF:                             ; CODE XREF: FastCompare3_Double+22↑j
                                        ; FastCompare3_Double+4D↑j
                pop     edx
                pop     ecx
                pop     esi
                mov     esp, ebp
                pop     ebp
                retn    0Ch
FastCompare3_Double endp

To use the function you can do this:

Code: [Select]
Myreal8Table    dd 3 dup(0), 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0
dd 8, 0, 9, 0, 0Ah, 0, 0Bh, 0, 0Ch, 0, 0Dh, 0, 0Eh, 0
dd 0Fh, 0, 10h, 0, 11h, 0, 12h, 0, 13h, 0, 14h, 0, 15h
dd 0, 16h, 0, 17h, 0, 18h, 0, 19h, 0, 1Ah, 0, 1Bh, 0, 1Ch
dd 0, 1Dh, 0, 1Eh, 0, 1Fh, 0, 20h, 0, 21h, 0, 22h, 0, 23h
dd 0, 24h, 0, 25h, 0, 26h, 0, 27h, 0, 28h, 0, 29h, 0, 2Ah
dd 0, 2Bh, 0, 2Ch, 0, 2Dh, 0, 2Eh, 0, 2Fh, 0, 30h, 0, 31h
dd 0, 32h, 0, 33h, 0, 34h, 0, 35h, 0, 36h, 0, 37h, 0, 38h
dd 0, 39h, 0, 3Ah, 0, 3Bh, 0, 3Ch, 0, 3Dh, 0, 3Eh, 0, 3Fh
dd 0, 40h, 0, 41h, 0, 42h, 0, 43h, 0, 44h, 0, 45h, 0, 46h
dd 0, 47h, 0, 48h, 0, 49h, 0, 4Ah, 0, 4Bh, 0, 4Ch, 0, 4Dh
dd 0, 4Eh, 0, 4Fh, 0, 50h, 0, 51h, 0, 52h, 0, 53h, 0, 54h
dd 0, 55h, 0, 56h, 0, 57h, 0, 58h, 0, 59h, 0, 5Ah, 0, 5Bh
dd 0, 5Ch, 0, 5Dh, 0, 5Eh, 0, 5Fh, 0, 60h, 0, 61h, 0, 62h
dd 0, 63h, 0, 64h, 0, 65h, 0, 66h, 0, 67h, 0, 68h, 0, 69h
dd 0, 6Ah, 0, 6Bh, 0, 6Ch, 0, 6Dh, 0, 6Eh, 0, 6Fh, 0, 70h
dd 0, 71h, 0, 72h, 0, 73h, 0, 74h, 0, 75h, 0, 76h, 0, 77h
dd 0, 78h, 0, 79h, 0, 7Ah, 0, 7Bh, 0, 7Ch, 0, 7Dh, 0, 7Eh
dd 0, 7Fh, 0, 80h, 0, 81h, 0, 82h, 0, 83h, 0, 84h, 0, 85h
dd 0, 86h, 0, 87h, 0, 88h, 0, 89h, 0, 8Ah, 0, 8Bh, 0, 8Ch
dd 0, 8Dh, 0, 8Eh, 0, 8Fh, 0, 90h, 0, 91h, 0, 92h, 0, 93h
dd 0, 94h, 0, 95h, 0, 96h, 0, 97h, 0, 98h, 0, 99h, 0, 9Ah
dd 0, 9Bh, 0, 9Ch, 0, 9Dh, 0, 9Eh, 0, 9Fh, 0, 0A0h, 0
dd 0A1h, 0, 0A2h, 0, 0A3h, 0, 0A4h, 0, 0A5h, 0, 0A6h, 0
dd 0A7h, 0, 0A8h, 0, 0A9h, 0, 0AAh, 0, 0ABh, 0, 0ACh, 0
dd 0ADh, 0, 0AEh, 0, 0AFh, 0, 0B0h, 0, 0B1h, 0, 0B2h, 0
dd 0B3h, 0, 0B4h, 0, 0B5h, 0, 0B6h, 0, 0B7h, 0, 0B8h, 0
dd 0B9h, 0, 0BAh, 0, 0BBh, 0, 0BCh, 0, 0BDh, 0, 0BEh, 0
dd 0BFh, 0, 0C0h, 0, 0C1h, 0, 0C2h, 0, 0C3h, 0, 0C4h, 0
dd 0C5h, 0, 0C6h, 0, 0C7h, 0, 0C8h, 0, 0C9h, 0, 0CAh, 0
dd 0CBh, 0, 0CCh, 0, 0CDh, 0, 0CEh, 0, 0CFh, 0, 0D0h, 0
dd 0D1h, 0, 0D2h, 0, 0D3h, 0, 0D4h, 0, 0D5h, 0, 0D6h, 0
dd 0D7h, 0, 0D8h, 0, 0D9h, 0, 0DAh, 0, 0DBh, 0, 0DCh, 0
dd 0DDh, 0, 0DEh, 0, 0DFh, 0, 0E0h, 0, 0E1h, 0, 0E2h, 0
dd 0E3h, 0, 0E4h, 0, 0E5h, 0, 0E6h, 0, 0E7h, 0, 0E8h, 0
dd 0E9h, 0, 0EAh, 0, 0EBh, 0, 0ECh, 0, 0EDh, 0, 0EEh, 0
dd 0EFh, 0, 0F0h, 0, 0F1h, 0, 0F2h, 0, 0F3h, 0, 0F4h, 0
dd 0F5h, 0, 0F6h, 0, 0F7h, 0, 0F8h, 0, 0F9h, 0, 0FAh, 0
dd 0FBh, 0, 0FCh, 0, 0FDh, 0, 0FEh, 0, 0FFh


_________________________________________________________________

push    256
push    253
push    offset Myreal8Table
call    FastCompare3_Double


Sorry for the assembly listing. I tried to port to masm, but i forgot the proper syntax, so i had to disassemble the function to you see.

In RosAsm the function looks like:

Code: [Select]
Proc FastCompare3_Double:
    Arguments @pTable, @Value, @Size
    Local @CurSize, @half, @Other_Half, @Low, @mid, @Probe
    Uses esi, ecx, edx

    mov D@Low 0
    mov eax D@Size | shl eax 1 | mov D@CurSize eax
    mov esi D@pTable
    ..While D@CurSize > 0

        mov ecx D@CurSize | shr ecx 1 | mov D@half ecx ; half = cursize/2
        mov ecx D@Low | add ecx D@half | mov D@Mid ecx | mov eax D$esi+ecx*4+4 | mov D@Probe eax

        mov eax D@Value
        If eax = D@Probe
            mov eax D@Mid; | inc eax . Inc return the actual Pos (from 1 to XX) and not the index from 0 to XX
            shr eax 1
            ExitP
        Else_If eax > D@Probe
            mov eax D@Mid
        Else
            mov eax D@low
        End_If
        mov D@Low eax

        mov eax D@half | mov D@CurSize eax

    ..End_While

EndP


The cmovge is a replacement to the routines in the Else_If macro.  On this example, we are computing the exact value. If the value on the table is > MYValue we use Mid as the checking value. Otherwise we use the previous "low". So we exchange the value according to the condition on the jmp (Greater then on this case).

I never used cmovge before. But it seems to work as expected. This algorithm i ported and adapted from:
https://academy.realm.io/posts/how-we-beat-cpp-stl-binary-search

I´ll try to fix to it handles  the cases of a value to be found in between the previous and the next one. Probably will need another cmovge or cmovle. I[´ll see if i can fix.

So far, the currrent version is working here for exact match only.
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 01, 2019, 11:48:35 AM
Sorry, but I can't get it to work, Guga. In the meantime, I have designed a binary search algo, using the hiwords of the REAL8s. It is about twice as fast as repne scasd:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
1750000 random searches performed in 191 ms - repnz scasd
1750000 random searches performed in 97 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 96 ms
1750000 random searches performed in 96 ms
1750000 random searches performed in 95 ms
1750000 random searches performed in 93 ms
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 02, 2019, 12:21:45 AM
Tks a lot, JJ :t :t :t

Your version seems to work as expected including the values in between the limits :)

I tried to port the cmovge to masmbasic, but it seems to contains lots of errors while retrieving the values. A mismatch is being caused by a error of 1 or 2 missed values. So, when it was supposed to return 182 it returned 181 etc. Also your version seems to be faster. :t :t :t  But, if you want to try, this is the ported code. (Have lots of mismatches)

Code: [Select]
BinSearchP proc uses esi edi ebx edx ecx pTable, Val8:REAL8, numel
  mov edi, numel ; eax = size
  dec edi
  xor eax, eax
  shl edi, 1
  mov esi, pTable ; left

NotDone:

shr edi, 1
lea ecx, [edi+eax]
mov ebx, [esi+8*ecx+4]
cmp dword ptr Val8[4], ebx
cmovge eax, ecx

  test edi, edi
jne NotDone

cmp eax, dword ptr Val8[4]
jnae Exiting
     dec eax
Exiting:
  ret
BinSearchP endp

I gave a try and inserted your version on my CieLCH routine for colorize images and the result seems accurate. It do occurs some clippings on the Chroma yet, but it is most likely due to the limitation of the ratios between hue and chroma for a given Luminosity. I´ll see how to fix that later.

(https://i.imgur.com/Cl6wdLP.jpg)

The binary search routine you did works like a charm. Now, i´ll need to figure it out how to optimize a routine for sin, cosine and atan2 (arctang using atan2 function) computations. Optimizing those trigonometry functions willl make the convertion between RGBtoCieLCh/CieLChtoRGB works, at least, 2 to 4x faster then what it is actually. :icon_cool: :icon_cool: :icon_cool:

If you have somee examples of faster (and accurate at the same time) sin, cossine and atan2 fucntions, please post it so i can take a look and see if i can adapt to it those color convertion routines 8) 8) 8)
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 02, 2019, 01:39:46 AM
Thanks, Guga. There seems to be a little bug in your algo, but right now I have no time to chase it. But I attach a version that has both algos implemented. You just need to change the string on top:

BinSearch$ equ BinSearchJ       ; use BinSearchG to test Guga's version

My version is not foolproof yet, either: search the source for seed, there is one that produces in pos 5/6 two very close numbers. Since the algo checks only the high DWORD of the REAL8, the comparison depends on a handful of bits of the mantissa only. That works most of the time, and it works always if the table is a known one, but for a random table it may create mismatches. It may not even be a problem, though. I would have to see concrete applications of such an algo. It is indeed fast, maybe that's the only thing that counts.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 02, 2019, 03:52:37 AM
Tks JJ.

The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table as you did with:
Code: [Select]
  cmp ebx, [esi] ; <left?
  js @err
  cmp ebx, [edx] ; >right?
  jg @err

But, i was amazed it was also fast (despite the errors).

Quote
My version is not foolproof yet, either: search the source for seed, there is one that produces in pos 5/6 two very close numbers.
Maybe another final check for the Loword of the HIDWORD should make the trick to fix. It would work as the original verson but immediatelly after

Code: [Select]
  sub edx, pTable ; middle pos - original left
  sar edx, 3
  xchg eax, edx

So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.


Quote
I would have to see concrete applications of such an algo. It is indeed fast, maybe that's the only thing that counts.

Well...i can think on a "few" pratical applications where this couldd be used :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:  A faster binary search (for Real8 on this particular case and this situations of scanning on a ordered list) can be of extreme use for video, image and audio processing.  On the functions i´m creating for image processing, such algorithm is a must, speciaqlly because colorpaces functions makes heavy usage of complex mathematical equations that could be easily replaced by simple pointers to tables  and here iss where a faster binary search can be used.

For example, on my original version of CIELCHtoRGB, the part that actually converts the Value (In KFactor) to Red, Green or Blue colors are given by the following formula:

FinalColor = ((TempColor+0.055)/1.055)^2.4

The final color is computed also after checking for the threshold as i mentioned earlier


KFactor = [ColorNormalized+Offset)/(1+Offset))^Gamma]  * 100 ; If ColorNormalized (Color/255) is bigger then Treshold
KFactor = (ColorNormalized/Slope)  * 100 ; If ColorNormalized (Color/255) is smaller or equal to Treshold

On my original version i used the same way people often uses to calculate this stuff. So i inserted the formula on the final part of the color convertions routines such as:

Code: [Select]
Proc CieLCHtoRGB:
    Arguments @pLuminance, @pChroma, @pHue, @Red, @Green, @Blue, @Flag, @WhiteRef
    Structure @TempStorage 64, @pXDis 0, @pYDis 8, @pZDis 16, @TmpRedDis, 24, @TmpGreenDis 32, @TmpBlueDis 40, @pAFactorDis 48, @pBFactorDis 56
    Uses esi, ebx, ecx, edx, edi

    finit

(.....)

    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+FloatMatrices.M1Dis | fld R@pYDis | fmul R$esi+FloatMatrices.M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+FloatMatrices.M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+FloatMatrices.M4Dis | fld R@pYDis | fmul R$esi+FloatMatrices.M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+FloatMatrices.M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+FloatMatrices.M7Dis | fld R@pYDis | fmul R$esi+FloatMatrices.M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+FloatMatrices.M9Dis | faddp ST1 ST0 | fstp R$ecx

    call GammaLinearDecodingEx F_gamma, F_Offset, F_Slope, F_Treshold, F_OffsetPlusOne, D@Flag

    lea ecx D@TmpRedDis
    .Fpu_If R$ecx > R$F_Treshold
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$F_gamma | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$F_OffsetPlusOne | fsub R$F_Offset
    .Fpu_Else
        fld R$ecx | fmul R$F_Slope
    .Fpu_End_If
    mov ecx D@Red | fmul R$Float255 | fistp F$ecx

    lea ecx D@TmpGreenDis
    .Fpu_If R$ecx > R$F_Treshold
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$F_gamma | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$F_OffsetPlusOne | fsub R$F_Offset
    .Fpu_Else
        fld R$ecx | fmul R$F_Slope
    .Fpu_End_If
    mov ecx D@Green | fmul R$Float255 | fistp F$ecx

    lea ecx D@TmpBlueDis
    .Fpu_If R$ecx > R$F_Treshold
        ;Color = ((Color+0.055)/1.055)^2.4
        fld R$F_gamma | fld R$ecx | fyl2x | fld1 | fld ST1 | fprem | f2xm1 | faddp ST1 ST0 | fscale | fxch | fstp ST0
        fmul R$F_OffsetPlusOne | fsub R$F_Offset
    .Fpu_Else
        fld R$ecx | fmul R$F_Slope
    .Fpu_End_If
    mov ecx D@Blue | fmul R$Float255 | fistp F$ecx

    ; clip values outside the range
    mov ecx D@Red
    If D$ecx <s 0
        mov D$ecx 0
    Else_If D$ecx > 255
        mov D$ecx 255
    End_If

    mov ecx D@Green
    If D$ecx <s 0
        mov D$ecx 0
    Else_If D$ecx > 255
        mov D$ecx 255
    End_If

    mov ecx D@Blue
    If D$ecx <s 0
        mov D$ecx 0
    Else_If D$ecx > 255
        mov D$ecx 255
    End_If


EndP

See the problem ? To achieve the final color we need to do a bunch of mathematical operations using power functions to retrieve the proper value.

What i did was revert the formula and use them as a table that is pre-calculated way before CieLCHtoRGB is actually running. I found ouyt that the resultant values that generates the color have  fixed fractions (i named them as Kfactor). And the total amount of fractions are only 256 ! So, one fraction per color.

All i had to do is create a routine to calculate all those values and insert them on a huge structure, which i named as "WSMatrix" (So far, the size of this structure is 4388 bytes, but i´ll probably increase it because i need 2 or 3 more tables). I created a function called SetupWorkSpaceMatrixDataEx that can be used when the application starts, for example or under user choice only once.

The function SetupWorkSpaceMatrixDataEx pre-calculate literraly everything needed to convert from RGB to CieLCH and vice-versa, inserting all data (gamma, offset, white references, creating new matrices, calculating multiplicand fractions, stablishing limits etc etc) and put all of that on a single structure WSMatrix to be used internally.

The usage of a table and the pre-calculation of all those values is a must, specially because whenever i´m analysing a image/video etc, i no longer need to make all those computations for every pixel. All it is needed is for CieLCHtoRGB convertion is basically points to the precalculated data from WSMatrix and do simple math operations (if needed).

It´s a major advantage in terms of speed computation because we are no longer calculating everything for each pixel. For example, all those monster computationss i replaced with a simple :

Code: [Select]
Proc CieLCHtoRGB_Ex:
    Arguments @pLuminance, @pChroma, @pHue, @Red, @Green, @Blue, @pMatrix

    finit
    mov edx D@pChroma
    lea edi D@pAFactorDis | mov esi D@pHue | fld R$esi | fmul R$Degree_Radian | fcos | fmul R$edx | fstp R$edi
    lea edi D@pBFactorDis | mov esi D@pHue | fld R$esi | fmul R$Degree_Radian | fsin | fmul R$edx | fstp R$edi

(...)

    lea ecx D@TmpRedDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M1Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M2Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M3Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpGreenDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M4Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M5Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M6Dis | faddp ST1 ST0 | fstp R$ecx
    lea ecx D@TmpBlueDis | fld R@pXDis | fmul R$esi+WS_Matrix.Inverted.Red_M7Dis | fld R@pYDis | fmul R$esi+WS_Matrix.Inverted.Green_M8Dis | faddp ST1 ST0 | fld R@pZDis | fmul R$esi+WS_Matrix.Inverted.Blue_M9Dis | faddp ST1 ST0 | fstp R$ecx

    lea ebx D$esi+WS_Matrix.KFactorMapDis

    mov edi D@Red
    fld R@TmpRedDis | fmul R$Float100 | fstp R@TmpColorCheckDis
    lea eax D@TmpColorCheckDis
    call BinarySearch ebx, D$eax+4, 256
    mov D$edi eax

    mov edi D@Green
    fld R@TmpGreenDis | fmul R$Float100 | fstp R@TmpColorCheckDis
    lea eax D@TmpColorCheckDis
    call BinarySearch ebx, D$eax+4, 256
    mov D$edi eax

    mov edi D@Blue
    fld R@TmpBlueDis | fmul R$Float100 | fstp R@TmpColorCheckDis
    lea eax D@TmpColorCheckDis
    call BinarySearch ebx, D$eax+4, 256
    mov D$edi eax

EndP


I can also gain a bit more speed using the BinarySearch function inline. But i choose to use as a call to a function to see first, if it was working  (and it is  :greenclp: :greenclp: :greenclp: :greenclp: )


Didn´t even need to benchmark, since the difference in speed when using binarysearch rather then calculating "Color = ((Color+0.055)/1.055)^2.4" all the time, is noticed on naked eyes :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:


If you have time and can fix the binarysearch for foolproof it will be very handy.

Also, i´ll try to see a way to optimize the sin, cosinee and atan2 computations as well. Since both convertions RGB to CieLCH and it´s reversal) makes heavy usage of trigonometry operations like those, an optimizations is also needed to speed things even more.

Just to you have a small idea on how the optimization can make things better. When i started all of those convertion routines, the images i´m posting here to transfom Gray to Color took something around 40 minutes to finish processing. After using the tables technique, the convertion was done in 4 to 8 seconds (Including the time to pre-calculate all of this). Now...with your´s optimization and the fixes i made on the convertion routines, it is taking something around 2 to 4 seconds including the precalculation routines to colorize a 640*480 grayscale image using a 960*720 reference color image.


This is to say that with further optimizations, i can reduce this total time of less then 1 second including the precalculations. So, since the precalculations are made when the app starts for example, it means that the total amount of time of coloring a image can be done in a few miliseconds.

Also, once i suceed to make the colorization method output a file formed by a kind of sample structure, then all of this can be done in almost notime. All is needed is retrieve the pointers from an external file, make the necessary adjustements to those pointers, and voialáa, you can convert a grayscale image onto a colored one almost immediatelly.

If i can be able to do this with image, the same process can be applied to video processing. For a video with let´s say 50.000 frames, if to colorize each frame it takes, let´s say 100 miliseconds per frame, it means that the video will be totally colorized in something around 15 minutes rather then years since the original version that took around 45 minutes to colorize 1 single image :greensml: :greensml: :greensml: :greensml:
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 02, 2019, 04:26:11 AM
The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table
No, that's not a problem since we only feed valid numbers. The check can be omitted.

Quote
So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.
Yes, something like that. But in most cases it's not necessary, since the numbers will be "distant" enough anyway. We are checking the high DWORD, i.e. the one that contains the exponent and part of the mantissa:
(https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/IEEE_754_Double_Floating_Point_Format.svg/618px-IEEE_754_Double_Floating_Point_Format.svg.png)
That "part" is 32-11-1=20 bits, or about 6-7 digits. Should be sufficient in most cases ;-)
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 03, 2019, 02:54:13 AM
good work
I think there is a fast chebyshev on the board and maybe way of unroll to search r,g,b at once?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 03, 2019, 04:16:32 AM
Tks Daydreamer. :t :t

I´ll take a look at the chebyshev on the board and see if the adaptation to the functions works ok.

About a way to unroll to search for rgb at once...well..it could be done but there would lead to some problems  and one of them is a conceptual one.

1s) Creating such a method would requires at least one huge table. Each table should have something around 398 Mb. Since the range of RGB is 0FFFFFF we will searching for 16581375 pixels. So, 16581375*8 (size of each Real8) * 3 (3 elements, one for Luminance, other for Chroma and other for hue) we will end up on a table with almost 398 Mb to search within. If we need 2 more tables it will something around 800 Mb and so on.

2nd) The conceptual problem is, actually, the harder one. All those formulas uses fractions to work with and exponential/trigonometry/power math functions to compute the values of each element (Luminance, hue, chromas, etc). The main problem is that,from all those 16 millions colors a huge amount of them are indistinguishable between each other. For example, RGB = 199, 255, 251 produces practically the same result (in chroma/hue/luminance) as 198, 254, 250 or 197, 255, 251 or 196, 255, 251. The human eye can´t identify which color is which. So, on a conceptual point of view all of those values represents the exact same color. So it would be hard to choose which color combination to use to convert back (From CieLCh/CieLab/XYZ to RGB) or simply by making the pointer to a table which will then points to another location.

What i found out on those perpcetual colorspaces is that they all claims that Luminance is completelly isolated from Chroma or Hue but, in fact, on their formulas Luma influences chroma and hue (although on a minor percentage). I also found a meaning of what is actually a gray color. Analysing the behaviour of luminance i found out that what we call gray is nothing more then a range of luminance spread in 256 pieces.

So, what will determine which gray color a RGB combination will have is basically the luminance.

So, for example, no matter what are the RGB combination used or what is the chroma or hue etc, whenever the luminance have a range in between
0 to a value smaller then 2.741066938704112e-1 it always will return in 0 as gray.

Therefore since gray represents the amount of luminance of a pixel, it also means that colors (on their isolated channels) are nothing more then the variation of luminance and thus, gray range. Thinking on that i came up with a table to be used to convert from luma to gray (according to the chosen model: Adfobe RGB 1998, sRGB, HDTV etc etc). It is something like this:

    Gray  Luminosity (Min)
    0       0
    1       2.741066938704112e-1
    2       5.48213387740825841e-1
    3       8.22320081611237263e-1
    4       1.0964267754816519
    5       1.37053346935206344
    6       1.64464016322247808
    7       1.91874685709288939
(...)
    253     99.309586872082832
    254     99.6549222327689391
    255     100

    So, the range of gray can be interpreted as:
   
    Color   Luminosity range
    0       0 to < 2.741066938704112e-1
    1       >= 2.741066938704112e-1 and < 5.48213387740825841e-1
    2       >= 5.48213387740825841e-1 and < 8.22320081611237263e-1
    ...
    254     >= 99.6549222327689391 and < 100
    255     = 100


This lead me to another problem using the "common" formulas for RGBtoCieLab/RGBtoLCH. Assuming that we are working with tiny ranges of gray, i tried to make the RGBtoCieLab/RGBtoLCH function fixes the range to it´s minimum. So, for example, if a color combination results on a luminance of 6.7e-1, it means that this color luminance is related to the color 2 (Gray = 2), thus i´ll need to decrease the luminance value to it fit´s the minimum of 5.48213387740825841e-1.

But, then a problem came up. When i do this, the resultant values are converted accordly, so it reduces the chroma, but....since the formula also relates Y (From XYZ) to luminance and Y is directly related to the R,G,B pixels, if i reduced the value of luminance i´ll need at end to proportionally reduce the value of R,G,B individually.

The main problem is that when i do the convertion backwards, it will also reduce the value to a new RGB range. For example, inputing RGB = 199, 255, 251, it is in fact this combination: RGB = 198, 254, 250 (if the reduction is proportional) or RGB = 196, 255, 251 (if the reduction is biased on Chroma only).

But...when i convert back those reducted values 198, 254, 250 or 196, 255, 251 they will keep the proportion and reduce again decreasing each pixel value by 1.

This is because the current formulas used to compute the perceptual colorspaces works with Fractions to compute pixels whose values are integer (There´s no such a thing as a red = 154.56564...The value is simply 154 (or 155 according to  the used rounding mode).

To fix that i´ll have to find the limits in between Chroma, Hue and luminance, regardless their claims (false ones) that they are completely separated things.

The correct equation to compute all of that crap is this something like this.

Z = (Chroma*sin(hue)/200) + (Luma+16)/116

Where Z is the Z element of the XYZ colorspace.

But...Z is directly related to R,G,B pixels after multiplying by a matrix, in the form of: Red*Matrix7+Green*Matrix8+Blue*Matrix9

If i reduce proportionally, i´ll necessarily assume that Red, Green and Blue must be proportionally reduced as well....Or one of them will be readjusted to fits to the Z value. In any case, the resultant chroma and hue will be altered on a way that the backwards computation will be impossible because one of the channels will always be reduced on each time we use the function, due to the problem of using integers to compute fraction-ed elements and trying to do the backwards computation where there no such a thing as a fraction-ed red, green or blue.

Not sure if i´m being clear about all of this. The bottom line is that those convertion algorithms suchs  :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

I´m pretty sure that i´ll need to find some limits to work on those things and end to build 2 or 3 more tables. The problem is find them for each color range. :redface: :redface:
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 03, 2019, 05:06:55 AM
sounds very complicated

question,if I want to tween between two pixels with colorspace HSV or CieLCH ,or any colorspace with Hue,maybe I need some IF to check direction,so I dont end up with hue pixel1=5 degrees,hue pixel2=350 degrees,tweens between 5 and 350 degrees,when it should tween for example 5degrees,2.5,0,357.5,355,352.5,350 instead?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 03, 2019, 06:54:57 AM
I´m not sure if i understood. Tween means changing hues to avoid large difference ? Don´t know the meaning of "tween" in english. Google returned me it means 'tweak', but..still don´t understood :icon_mrgreen:. If it is, the HSV, CieLCH colorspaces already have some If conditions to deal with the Hue direction problem. The direction is checked using thresholds. To fix problems of hue directions (so, the opposed signs) is necessary to compute the "a" and "b" factor from CieLab colorspace (Or as i did directly from CieLCH).

For example, on CieLCh colorspace, Hue is determined by the following equation:

Hue  = Atan (b/a), where b is the "Bfactor" from CieLab and "a" the "Afactor" from CieLab

"a" and "b" are achieved from the X,Y, Z values from the XYZ colorspace using these formulas:

a = (X-Y)*500
b = (Y-Z)*200

whereas, Luma = Y*116-16

Also, the value of "a" and "b' can be found using the CieLCh colorspace with the following formulas:
a = cos(hue)*chroma
b = sin(hue)*chroma


The correct way to find "a" depends on the colorspce you are working with. I made a variation of CieLCH that uses XYZ values directly rather then having to migrate in between each colorspace to find the very same value.

The final values of X, Y and Z are calculated from a multiplication of Red, Green and Blue with the proper matrices for AdobeRGB, sRGb etc. and after a IF condition to check some thresholds.

The If condition is in the form of:

If  X > (216/24389)
    X = X^1/3
Else
     X = X*(841/108) + (16/116)
Endif


The same conditions are used for Y and Z


The complete pseudo-code to transform RGB to CieLCh is this:

Red = Red/255 ; The integer colors must be normalized 1st. So, divide each color (Integer) by 255.
Green = Red/255
Blue = Red/255

TmpRed = Red*Kfactor
TmpGreen = Green*Kfactor
TmpBlue = Blue*Kfactor

where "Kfactor" is computed using the gamma and offset adjustments . (If needed i post the formula to achieve "Kfactor" for you).

X = (TmpRed*Matrix.RedM1 + TmpGreen*Matrix.GreenM2 + TmpBlue+Matrix.BlueM3) / WhiteReferenceX
Y = (TmpRed*Matrix.RedM4 + TmpGreen*Matrix.GreenM5 + TmpBlue+Matrix.BlueM6) / WhiteReferenceY
Z = (TmpRed*Matrix.RedM7 + TmpGreen*Matrix.GreenM8 + TmpBlue+Matrix.BlueM9) / WhiteReferenceZ


where:
Matrix.RedM1, Matrix.RedM2, Matrix.RedM3 ... are the tristimulus values for XYZ colorspace used in Adobe1998, sRGB etc. Example, for sRGB, the values are:

Matrix.RedM1     Matrix.GreenM2       Matrix.BlueM3
0.4124564          0.3575761              0.1804375
Matrix.RedM4     Matrix.GreenM5       Matrix.BlueM6
0.2126729           0.7151522              0.0721749
Matrix.RedM7     Matrix.GreenM8       Matrix.BlueM9
0.0193339           0.1191920              0.9503041

WhiteReferenceX, WhiteReferenceY, WhiteReference| are the white references for observers in 2º or 10º degree described in CIE convention. For example:

For D65, 2º observer (Daylight, sRGB, Adobe-RGB)

WhiteReferenceX = 95.047 . Btw: WhiteReference is nothing more then the simple sum of the matrice values multiplied by 100. Ex: (0.4124564+0.3575761+0.1804375)*100
WhiteReferenceY = 100
WhiteReferenceZ = 108.883




If  X > (216/24389)
    X = X^1/3
Else
     X = X*(841/108) + (16/116)
Endif

If  Y > (216/24389)
    Y = Y^1/3
Else
     Y = Y*(841/108) + (16/116)
Endif

If  Z > (216/24389)
    Z = Z^1/3
Else
     Z = Z*(841/108) + (16/116)
Endif

; The above X,Y,Z are the same ones found in XYZ colorspace. I then convert them to lab and after it to hue/chroma (In fact, i make it directly on the same function) , but, below are the necessary equation used, in case you want only the CieLab convertion

; retrieve a, b and Luminance

Luma = Y*116-16
a = (X-Y)*500
b = (Y-Z)*200

and finally, achieve Hue and chroma from "a" and "b"

Chroma = sqrt(a^2+b^2)
Hue = atan (b/a)


This is the pseudo-code to convert from RGB to CieLCH. Here i´m using gamma adjustment (The "Kfactor" i created to make those equations easier and the code faster to process) and white reference to achieve a higher level of accuracy and making the image looks better according to a chosen colorspace (Adobe RGB 1998, sRGB, HDTV etc)


The whole convertion itself is not that complicated after i simplified the equations. The main problem relies on the way those conversion works. CieLab and other perceptual colorspaces have something called color discontinuity that is nothing more nothing less then a failure on the original equations (I call it simply as: bulls* :greensml: :greensml: :greensml:)

bruce lindbloom make an effort to fix the error here: http://www.brucelindbloom.com/LContinuity.html

In fact, on a mathematical point of view, he was correct, but conceptually, it still don´t fixes the problem with those colorspaces.

For example, using Y*116-16 will necessarily lead to an error on the way color is perceived, since we are generating a "gap" of 16 values on the equations (I have no idea how to fix that properly. So...i´m using Bruce´s suggestion)

One way to fix those colorspaces is create a new one as i did last year. The colorspace i designed is spherical, using all 3 dimensions as the vertices for Red, Green and Blue without any usage of those "gaps". The usage of a spherical colorspace seems more logical and accurate since "color' is nothing more then the form that our eyes (brain, in fact), perceive an eletromagnetic wave after passing through our eyes and cells responsible for color perception. Eletromagnetic waves travels in space in all directions, at the same speed.  Considering a tridimensional space, then we can then represent each x,y,z axis as Red, Green and Blue and work with them proportionally (Spherically), instead using those crazy equations actually used. It is a variation of a TSL colorspace as i mentioned earlier. I succeeded to create a formula to convert back, but still have problems, because i have to find the limits the resultant values and after a while i gave up trying. maybe eventually i´ll restart that work.
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 03, 2019, 06:36:22 PM
sorry I try to practice english everyday,but sometimes it sneaks in wrong words and I am not native english speaker
think of you open up a dialog of image resizing in a image program,photoshop etc,you get a dialog of different interpolationalternatives,linear etc
but think interpolation while in HSV,HSL or CieLCH color space image,Hue channel
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 03, 2019, 10:15:16 PM
Hi guga,

Interesting topic.

As far as I understand you are calculating a new colored image by using the luminance
of a source image and the chroma colors of a reference image?

So the calculated final image has max. 256 different colors?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 04, 2019, 01:35:38 AM
Hi Daydreamer. I asked because im also not a native english speaker   :bgrin: :bgrin: :bgrin:

I understood now. But when you open a dialog and change the Hue angle, the equations used are always those ones i posted. The problem is that hue is directly related to chroma and not with luminosity. So, when changing hue you end up also changing chroma and vice-versa. The main problem i´m facing is find the relation between chroma and hue on such a way that when you are doing the reverted operation (CieLab to RGB) there will no more need to clip anything.

Hi Siekmanski :) Yes...for this preliminary tests i´m transferring only a maximum of 256 colors. Once i suceed to fix those damn problems on the CieLCH equations, i can then try to make the new image contains 65536 and later 16 millions colors. But, i choose to try the very basics 1st to see if it works. But, this technique i´ll do later, once i finish to find a way to fix the CieLCH equations.

So far it works ok on a maxmimum of 256 colors. It get´s the luminance of all pixels on the colored image and searches for the ones that best macthes with the luminance of the gray image and do the transfer. To make it extend the amount of colors, i´ll need to search for the neighbor pixels and see what chunck of pixels on the gray image matches with the same chunck of pixels on the colored one. (Transferring the texture, then.)

The resultant image, so far is Ok, i mean, I was able to transfer the exact same luminance of both images and transferring the chromacity of the reference to the target (gray). If i then convert the new colored image to grayscale again, it will have the same gray pixels as before the convertion . This shows me that the transfer was ok:)

However, it still is ocurring some clippings on the colored image due to this damn lack of limits on the CieLab/CieLCH equations.  I´m struggling to find something as simple as a fraction to stablish the limits. SOmething like: x (From luma1) = Chroma/Hue ; y (From Luma2_ = Chroma (From luma2)/Hue (Also from luma2) etc etc etc.

If the equations were linear it would be eaiser but... :icon_eek: :icon_eek: :icon_eek:

It´s important to fix the CieLCH issues because on the reference image (colored) one, there´s no need to search for thousands of repeated colors or the ones that are perceptual indistinguishable from each other. I can create a sort of sample file from the reference image more accurated then. This is to say, if i have a colored image of 960*720, i don´t need thousands of colors combinations. I´ll need only a few bunch of them. I can create a file containing, something around 65000 thousand different (unique) samples to be chosen for any transfer technique needed. So, no matter if i extend the final limit to 65536 or 16 million colored images (Unnecessary, btw, since from all 16 million colors, only a few thousands of them are perceptual different from each other), it will always look for the same 65536 samples of the reference one to create the new color image. It will only depends of the position of the pixels etc. Also, if i later start using the neighbour pixels technique, it is possible to create a new color pixel biased on the reference (I mean, a new color that don´t exists on the original reference image). For example 2 pixels in between one have the same (or similar) hue, but different chromacity. Then, all is needed is take the average chromacity and hue among them to create the new pixel existent in between.

Sure, i can also create a file (or several files) containing 16 million colors samples, but so far i´m doing slow to see how it works.
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 04, 2019, 02:14:18 AM
Can you post the original source and reference images?
So I can play with the algorithms.....
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 04, 2019, 02:16:54 AM
Sure  :t :t :t

let me only clean up the code before posting. I´m doing it right now.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 04, 2019, 10:50:04 AM
Hi SiekManski

Here is the cleaned file (I had to upload in my google drive because it´s big for upload here).  I created a small app to be tested. You only need to load an image and incrase/decrease the Scroolbar btn to change the luminance up or down. This is a dumb test for the RGBtoCieLCH and CieLCHtoRGB convertions.

The main routines responsible for the conversion are the functions:

    RGBtoCieLCH_Ex
    CieLCHtoRGB_Ex

They are located inside a function i named as "NewAdjustLuma" This functon simply adjust the luminance. So it increases or reduces the luminance from those conversion functions.

Also, for the data that is preloaded when the app starts, the function is: SetupWorkSpaceMatrixDataEx located in the WM_CREATE on the main window procedure.

SetupWorkSpaceMatrixDataEx is responsible to create all necessary data to be used on the conversion functions. It generates the gamma values, white references , compute the adaptation method (I´m using the default mode. I created a serie of equates for the several different color models, but, the better is use CS_MATRIX_SRGB_D65_HDTV as it is. On this way we can test on the same method.

WS_Matrix is a huge structure where all necessary data is calculated and preloaded once the application starts. is upon some members of the WS_Matrix that the functions "RGBtoCieLCH_Ex" and "CieLCHtoRGB_Ex" grb the necessary ata, simply pointing to the proper addresses rather then haveing to calculate all over again for every pixel.


Btw...The source code is for RosAsm, so it´s embedded in the app. I uploaded a new version of RosAsm i uses to my personal tests here: https://www.tapatalk.com/groups/rosasm/rosasm-download-t2.html  I tried to upload here, but the file is big for the attachement.  No need to make deep configurations, just unzip rosasm on any directory and once the configuration dialog shows up, simply points the asked files on the proper path. (Equates.equ, clip.txt etc...) Those files are located on the RosAsmFiles subdirectory. So you need only to point there when(or if) asked.

And here is the link to the test file for the CieLCH convertion. (If you need i rip the asm source alone and also i can try making a pdf explaining the funcionality of the functions, or the convertion algorithms.)

https://drive.google.com/open?id=1gztsNpNw4_61oHYrURpc1b81tyW_1hKZ


About the images i use to test, i suceeded to attach here

Note: Those routines and test app is not for the grayscale to color convertion. What i sent was the CieLCH convertiosn to be fixed 1st. The other part of the code necessary for the grayscale to color is kinda huge and i´ll need to clean that too, because it´s a total mess the way it is already.

Can you help trying to fix the CieLCH convertions, 1st  and later when we succeeded we give a try on the colorization method ? Maybe it be better understand the whole code, i guess.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 04, 2019, 11:36:58 AM
And here´s the source for the CieLCHtoRGB and RGBtoCieLCh functions.

Btw....the "convert" button on the test file i sent, do...nothing :greensml: :greensml: To make easier i removed the mess that was related to it. It was the one responsible for converting from gray to color, but since on this test i removed that code to we test 1st the colorspace convertions, i forgot to remove that button as well :redface: :redface: :redface:
Title: Re: Fast Compare Real8 with SSE
Post by: HSE on February 04, 2019, 01:19:14 PM
Btw....the "convert" button on the test file i sent, do...nothing :greensml: :greensml:
Very, very funny. I was thinking that was so slow  :biggrin:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 04, 2019, 02:51:59 PM
 :redface: :redface: :redface: :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

Once the fixes of the colorspace conversions were sucessifull i´ll post another version without that btn. And later another one containing the full (and cleaned) code to the gray to color transfer. :t :t
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 04, 2019, 04:03:20 PM
Thanks, where can I get the image of the woman with the colored lips?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 04, 2019, 04:21:06 PM
Attached here

Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 05, 2019, 10:59:58 PM
Hi Guga,

See http://masm32.com/board/index.php?topic=94.msg83812#msg83812 for an update on the "find the number" problem.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 05, 2019, 11:56:42 PM
Tks a lot, JJ :)  :t :t :t

I`m creating a pdf about the CieLCH routines describing the algorithms and trying to figure it out some limits of it to be used on the backwards convertion from CieLCH to RGB. As soon i finish, i´ll post it here to see if you guys can help me finding the correct limits for this. The goal is make the backwards calculation more accurate  avoiding clipping the final result. And, of course, once it´s fixed we can try optimize it further. I read your´s algo on fastsin to later use on the CieLCH routines, and it´s amazing :)

The next  optimizations will be in sin, cos, atan and power (exponencial, logaritm etc) functions :)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 07, 2019, 10:12:37 AM
Hi Guys

Here is a pdf with the functionality of the RGB to CieLCh colorspace. Can someone help me seeing if the math operations (and it´s properties) are ok ?

I tried to build the properties of the math to make easier to the conversion backwards works ok.

I found out that

Chroma, Luminance and Hue have direct relations/limits/dependencies among them.

The final property of all of that seems to be:

Luma > 116* [ (5*Afactor - 2*BFactor)/1000 ] -16 ; where "BFactor" and "Afactor" are the "a" and "b" values for the CieLab colorspace

For Chroma, i found that this relation/limit is:

Chroma  <  [(1000/116)*(Luma+16)] / [5*cos(Hue)-2*sin(Hue)]

Knowing these properties are important to build the backward computation (CieLCHtoRGB), 1st check for it´s limits before trying to convert back.

Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 07, 2019, 12:25:39 PM
Tks a lot, JJ :)  :t :t :t

I`m creating a pdf about the CieLCH routines describing the algorithms and trying to figure it out some limits of it to be used on the backwards convertion from CieLCH to RGB. As soon i finish, i´ll post it here to see if you guys can help me finding the correct limits for this. The goal is make the backwards calculation more accurate  avoiding clipping the final result. And, of course, once it´s fixed we can try optimize it further. I read your´s algo on fastsin to later use on the CieLCH routines, and it´s amazing :)

The next  optimizations will be in sin, cos, atan and power (exponencial, logaritm etc) functions :)
its not good to go from real8 precision color channels to RGB being integer byte size,I think it should be better with floats and maybe final antialiasing filtering before final conversion to bytesized RGB channels
maybe try 2tiles,which user can choose bigger or smaller tilesize to match their cpu's cache better,2tiles=two threads/two cores,seen most timing reports here show most of us have modern cpu's,so why not try that solution?

I also have question about minimum reference pic size to make it work?are there any rule you follow when you have 4times bigger than greyscale?

read PDF,ouch too many divisions there
slope and threshold is depending on slope,would it be possible to change slope calculation to get inverse slope instead,so you can change to threshold=offset/(gamma-1)*inverseslope ?
restricting max/min of gamma can be made using MAXSD and MINSD
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 07, 2019, 02:40:08 PM
Hi daydreamer :)

Quote
its not good to go from real8 precision color channels to RGB being integer byte size,I think it should be better with floats and maybe final antialiasing filtering before final conversion to bytesized RGB channels
maybe try 2 tiles,which user can choose bigger or smaller tilesize to match their cpu's cache better,2tiles=two threads/two cores,seen most timing reports here show most of us have modern cpu's,so why not try that solution?

I prefer first to try to understand how Luminance, Chroma and Hue are related to each other. It is possible to use integers (From 0 to 255), instead Floating Points (Real8 or Real4), but, in order to do that, those "monster" equations needs to be fixed. There´s no need for an anti-aliasing filter on the Conversion functions. It would slow down the final conversion a lot and there´s no guarantee that the final result would be accurate. ABout user´s choice on titlesize or using more then 1 thread, i overcome that using tables for luminance (Kfactor and Lumamap) Kfactor are only 256 different and unique values regarding the gamma/offset, already pre-calculated before the main function starts. This speeds ups the computation a lot, since we don´t need anymore to perform all gamma calculation whenever we use the RGB to CieLCH function. The same thing for a lumaMap i created that is a simple table of 256 ranges of luminance each one of them related to a gray color as well. Both tables works on the 2 types of operation, i mean, works for RGB to CieLCH and the backward computation (CieLCh to RGB).

Quote
I also have question about minimum reference pic size to make it work?are there any rule you follow when you have 4times bigger than greyscale?

Well...i didn´t established any limit for the size of a image yet. At least not on the RGB to CieLCH and CieLCH to RGB functions. However, i was forced to create some limit on the algorithm i´m making for convert gray to color, because if the reference image is big (Let´s say, 2900x1920 etc ) it will generate a buffer in memory with something around 2 GB to create the samples, and i was forced to limit that because on my PC, windows refused to work on files (or memory buffer) with 2GB :greensml: :greensml: :greensml:
Of course, this was only a preliminary test, i don´t plan to create a monster file sample to be used. The plan is create several different tiny pieces of samples to be used. I´m pretty sure that, even a small sized colored image, such as 64x64 can provide enough samples to convert from gray to color, but i did not reached that part of the development yet.

The importance of the RGB to CieLab and his reversal algorithms are exactly to avoid the usage of unnecessary color combinations. From what i´m concluding so far, there´s simply no such a thing as 16 million colors. We may have millions of "colors" that are the exact same thing as another combination of values and all we need to do is a way to find them to be removed from the final computation. Pixel values are one thing, but colors (In terms of hue/chroma/luma inter-dependencies) are another. Ex: RGB (100,99,40) is the same color as RGB(100,99,48) despite the fact that it differs in 1 byte/pixel. And we have thousands (if not millions) of those unnecessary color combinations to represent the very same color.

I was surprised to see that luminance was responsible for the color that a gray image should have and also even more when figure it out that gray is nothing more then a small range of luminance.

So, i´m trying to figure it out if such ranges also exists for chroma and hue. I mean, if we have a luminance range from 88 to 89 (Let´s say, a gray color of 136), i wonder if that range also have specific ranges of chroma and hue to work with.If i did the math correctly (and it is a Big IF  :bgrin: :bgrin: :bgrin:) then this limits may exists if the following equation is accurate:

Chroma  <  [(1000/116)*(Luma+16)] / [5*cos(Hue)-2*sin(Hue)]

The only problem is see if there is a limit for chroma (or even hue) for each luminance range as well. (I hope there is :) ) If it do, all it is needed is one or more table containing those ranges (min and max) and the backwards conversion will be even more accurate (and perhaps faster).

Quote
slope and threshold is depending on slope,would it be possible to change slope calculation to get inverse slope instead,so you can change to threshold=offset/(gamma-1)*inverseslope ?
restricting max/min of gamma can be made using MAXSD and MINSD

I already did the backwards  computation. I just didn´t created the pdf containing the reversal (backwards) algorithm yet, because those limits needs to be found 1st. If there are such limits/boundaries, then the backwards computation will be more accurate and without clipping.

If necessary, i can try making the backwards pdf right now, but, perhaps, solving that small issues on the algorithm would be better, i think.

The backwards maths is quite trivial, btw. It´s basically the multiplication of the inverted matrices, after converting back from Chroma/Hue/Luma to XYZ colorspace. Once the multiplication is applied it retrieves back the same values of the "Kfactor" (gamma, etc etc), and all is needed is search for them on their own tables, rather then making more math to retrieve that exact same color value.

Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 08, 2019, 03:11:58 AM
First of all, I'm not an expert in this topic and maybe I'm totally missing the point here.  :icon_eek:

Just my thoughts over this,

A = Source Bitmap
B = Reference Bitmap
D = Destination Bitmap
L = Luminance = 256 grey values ( humans are most sensitive to these )
C = Chroma = 256 * 256 = 65536 color values

As far as I understand we are using the L of A and the C from B to construct D.

Then we need to construct a Chroma table from B and calculate the missing Chroma colors to complete the total of 256 Chroma colors in the order of the 256 L (grey) values.
Now we can combine the new Chroma table with the Luminance of A to construct D.

How do we handle different Chroma values with the same Luminance?
Do we average them?

The question is which color space model do we use for this?

My first thought was, because we only deal with a maximum of 256 different pixel colors for the Destination bitmap,
why not calculate the missing Chroma colors by interpolating between the neighbour Chroma colors, using RGB2YUV <-> YUV2RGB....... ( we can do this fast in SIMD with Dot-Matrix calculations )

If we are lucky, all the Luminance values are present in the Reference Bitmap and we don't need any missing color space conversion calculations, only averaging?

The CIE L*C*h Color Space is a very costly and heavy calculation:

a and b are the polar coordinates:
Chroma = sqrt(a*a+b*b)
hue angle = atan(b/a)

convert back from polar to cartesian coordinates:
a = Chroma*cos(hue angle)
b = Chroma*sin(hue angle)

We could calculate it on the fly and use Chebyshev polynomials to do the trigonometry calculations....
Maybe we can simplify the trigonometry by using precalculated lookup tables?

If we look at the CIE color space image below, we could precalculate the outer rim color values.
Interpolating between the 4 colors red, yellow, green, blue and red, we will end up with 1024 Chroma values which hold all the Chroma values we need.
So we have 256 (Luma) * 1024 (Chroma) = 262144 colors to choose from to construct the 256 color Destination Bitmap Image.

(https://sensing.konicaminolta.us/uploads/ciecolorspace-29309si549.jpg)

The fastest way I think would be approximate the coefficients for a Dot-Matrix conversion, vice versa.

This weekend I will try to put something together if time permits.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 08, 2019, 09:01:37 AM
Hi Siekmanski.

Quote
As far as I understand we are using the L of A and the C from B to construct D.

Then we need to construct a Chroma table from B and calculate the missing Chroma colors to complete the total of 256 Chroma colors in the order of the 256 L (grey) values.
Now we can combine the new Chroma table with the Luminance of A to construct D.

How do we handle different Chroma values with the same Luminance?
Do we average them?

Yes, we are calculating the luminosity from the gray values and comparing it from the colored image.

STEP1 - Building the necessary tables of samples.

I built 2 tables. One to be used as a map pointer and other for the colored pixels themselves.

The 1st step is map every pixel and convert them to gray (that can used as an index, for example) luminosity, chroma and hue. I`m showing you here the very basic method of color transfer. I just extended it a bit, to allow other methods to be implemented later, ok ?

It is a array of structures on this format:

[Pixmap.PixAddress: D$ 0 ; a pointer to the address of the color pixel . Currently not used, but it may be necessary for other color methods, such as search for the neighbors pixels
 Pixmap.Gray: D$ 0 ; The gray value of that pixel. For example, RGB = 200, 212, 100, can result in gray of 150 (It´s just, an example. The correct value may be another). So we convert the color pix to gray and put it´s value here
 PixMap.Luma: R$ 0 ; A Real 8 value containing the luminosity of the colored pixel.
 PixMap.Chroma: R$ 0 ; A Real8 value containing the chromacity of the colored pixel
 PixMap.Hue: R$ 0 ; A Real8 value containing the Hue of the colored pixel
]

Size of PixMap = 32  bytes (In fact it is bigger, since it contains some more members in the current structure i updated, but i´m showing the basics, since the other members i´ll may remove those later while developing)

This structure forms a sample, from where we use to scan for a best match for it´s luminosity.

So, if the image has a size of 640*480, there are 307.200 pixels available as a potencial sample. Thus, we need a array of structures of 9830400 bytes long. (640*480*32 bytes)

I then sort this array of structures in crescent order of it´s luminance. (Btw...the each gray value corresponds to a range of luminance, so it will be easier later to find it´s correspondance on the map)

So, the Array after ordering may looks like this:

[Pixmap.Data0.PixAddress: D$ XXXX
 Pixmap.Data0.Gray: D$ 0 ; Once we order by luma, it will always starts with gray = 0 too ;) Luma Range for gray = 0 is >= 0 to < 2.741066938704112e-1.
 PixMap.Data0.Luma: R$ 1.23526e-2
 PixMap.Data0.Chroma: R$ 45.45646 ; In fact, it´s closer to 0 here. It´s just an example ;)
 PixMap.Data0.Hue: R$ 320.124

 Pixmap.Data1.PixAddress: D$ XXXX
 Pixmap.Data1.Gray: D$ 0  ; Again...if the image can produce a gray color whose value is 0, this member will also corresponds to it.
 PixMap.Data1.Luma: R$ 2.1e-1
 PixMap.Data1.Chroma: R$ 55.145
 PixMap.Data1.Hue: R$ 40.1454

 Pixmap.Data2.PixAddress: D$ XXXX
 Pixmap.Data2.Gray: D$ 1  ; Now the colored pixel corresponds to a gray whose value is 1. In other words, if we convert that pixel to gray, it will result in 1. Luma Range for gray = 1  is >= 2.741066938704112e-1 to < 5.48213387740825841e-1
 PixMap.Data2.Luma: R$ 2.741066938704112e-1
 PixMap.Data2.Chroma: R$ 114.125
 PixMap.Data2.Hue: R$ 20.2565

 Pixmap.Data3.PixAddress: D$ XXXX
 Pixmap.Data3.Gray: D$ 1  ; Now the colored pixel corresponds to a gray whose value is 1. In other words, if we convert that pixel to gray, it will result in 1
 PixMap.Data3.Luma: R$ 3.46e-1
 PixMap.Data3.Chroma: R$ 15.125
 PixMap.Data3.Hue: R$ 37.256

 Pixmap.Data4.PixAddress: D$ XXXX
 Pixmap.Data4.Gray: D$ 1  ; Now the colored pixel corresponds to a gray whose value is 1.
 PixMap.Data4.Luma: R$ 5.125e-1
 PixMap.Data4.Chroma: R$ 25.415674
 PixMap.Data4.Hue: R$ 44.689

(...)
; and we continue the array untill the last pixel that should correspond to a gray color of 255 (If it do exists on the colored image, btw).

 Pixmap.Data307.199.PixAddress: D$ XXXX
 Pixmap.Data307.199.Gray: D$ 255  ; Now the colored pixel corresponds to a gray whose value is 255. In other words, if we convert that pixel to gray, it will result in 255
 PixMap.Data307.199.Luma: R$ 100
 PixMap.Data307.199.Chroma: R$ 124.2
 PixMap.Data307.199.Hue: R$ 45.18
]

So, our colored image produced 307200 samples and among them we can have 2 related to gray pixel = 0, 12548 related to gray pixel = 1 and so on. We are basically ordering to count how many gray pixels are generated from each colored one. Since we are using CieLCH colorspace, each generated gray value from the colored pixel will always correspond to a given range of luminosity (also in crescent order), so, when ordering the samples, whenever you order it by lumonsity you won´t have to worry about the ordering of the corresponding gray, because it will always corresponds to a fixed value (From 0 to 255) for a given luma range.

As explained previously, I found out that  luma is directly related to gray, no matter what is the pixel combination. Thus, the 256 gray values always corresponds to 256 chunks/ranges of luma. Like this for sRGB D65:

    Gray  Luminosity (Min)
    0       0
    1       2.741066938704112e-1
    2       5.48213387740825841e-1
    3       8.22320081611237263e-1
    4       1.0964267754816519
    5       1.37053346935206344
    6       1.64464016322247808
    7       1.91874685709288939
(...)
    253     99.309586872082832
    254     99.6549222327689391
    255     100


So, once we finished creating a sample file we can now map it to make the transfer method easier and faster.

To do that, i created a 2nd table containing only 256 structures whose members maps the samples. Each structure is formed by 5 dwords.

[LinkedColoredMap.Data0.GrayColor: D$ 0 ; the gray color index. Always starts with 0 corresponding to gray 0. I know, it maybe unnecessary, sice we can simply poits to the array directly, but i´m using it here for further development
 LinkedColoredMap.Data0.MapPtr: D$ 0 ; The pointer to the start of the sample corresponding to the gray color. in "GrayColor" Member. If there is no corresponding value, this member should be 0
 LinkedColoredMap.Data0.ColorCnt: D$ 0 ; The total amounts of gray of a certain value on the colored image.
 LinkedColoredMap.Data0.NextValidPtr: D$ 0 ; a Pointer to the next address in the sample array corresponding to the next gray value (For example, if the "GrayColor" member here is 100 we will point to the address on the sample map corresponding to gray 101 (If existant).  Also, if the GrayColor member here is 255, then the sample map will have no value at all (after all we have only 256 colors from 0 to 255), thus, this member will be 0 meaning we reached the end of the sample map. Also If the colored image does not produces any gray = 101, then this member will points to the next valid array (In the sample map) corresponding to gray 102 (also, if existant) and so on.
 LinkedColoredMap.Data0.PrevValidPtr: D$ 0 ; A pointer to the previous address in the sample array corresponding to the previous gray value. Similar as above, but backwards. (For example, if the "GrayColor" member here is 100 we will point to the address on the sample map corresponding to gray 99 (If existant).  Also, if the GrayColor member here is 0, then the sample map will have no value at all (after all we reached the start of the sample map, whose values have only 256 colors from 0 to 255), thus, this member will be 0 meaning we reached the start of the sample map and no previous pixels exists. Also If the colored image does not produces any gray = 99, then this member will points to the previous valid array (In the sample map) corresponding to gray 98 (also, if existant) and so on.

The general looks of this array of structure look like this:

[LinkedColoredMap:
 LinkedColoredMap.Data0.GrayColor: D$ 0
 LinkedColoredMap.Data0.MapPtr: D$ Pixmap.Data0.PixAddress ; Address of the start of sample map (colored) whose gray value corresponds to gray = 0. If the colored image don´t have any gray pixel of 0, this member should be 0
 LinkedColoredMap.Data0.ColorCnt: D$ 2 ; Only 2 pixels on the colored image produces a gray value of 0
 LinkedColoredMap.Data0.NextValidPtr: D$ Pixmap.Data2.PixAddress ; The sample image do have pixels that generate gray = 1, thus points to the start of the address in the sample map
 LinkedColoredMap.Data0.PrevValidPtr: D$ 0 ; Since we are at the start of sampple map (Gray = 0), there no pixel before it. So, here the value is 0

 LinkedColoredMap.Data1.GrayColor: D$ 1
 LinkedColoredMap.Data1.MapPtr: D$ Pixmap.Data2.PixAddress
 LinkedColoredMap.Data1.ColorCnt: D$ 12548 ; There are 12548 pixels on that colored image that, after converted to grayscale, corresponds to gray value = 1
 LinkedColoredMap.Data1.NextValidPtr: D$ Pixmap.Data12549.PixAddress ; Start of the next address on the sample map corresponding to gray value = 2 (If existant). If it do not exists it will points to the next address corresponding to gray = 3 and so on
 LinkedColoredMap.Data1.PrevValidPtr: D$ Pixmap.Data0.PixAddress ; Start of the previous address on the sample map corresponding to gray value = 0 (If existant). If it do not exists it will points to the previous address. Since there no pixel before "0", thus this member will be 0, meaning that the colored image when converted to grayscale does not have any 0 gray value.

 LinkedColoredMap.Data2.GrayColor: D$ 2
 LinkedColoredMap.Data2.MapPtr: D$ Pixmap.Data12549.PixAddress
 LinkedColoredMap.Data2.ColorCnt: D$ 30000
 LinkedColoredMap.Data2.NextValidPtr: D$ XXXX
 LinkedColoredMap.Data2.PrevValidPtr: D$ YYYYY

(...)
Until the 256th array
 LinkedColoredMap.Data255.GrayColor: D$ 255
 LinkedColoredMap.Data255.MapPtr: D$ Pixmap.DataZZZZZ.PixAddress
 LinkedColoredMap.Data255.ColorCnt: D$ 4545
 LinkedColoredMap.Data255.NextValidPtr: D$ XXXX
 LinkedColoredMap.Data255.PrevValidPtr: D$ 0 ;  There nothing forward it


STEP 2 - Finding the Best match


In order to find which luminance best matches all is needed is to point the gray value (from the gray image to it´s correspondent member in the LinkedMap structure and start searching inside of it.

For example:

The pixel i want to color in the gray image is the value of "1" (From 0 to 255).

What needs to do is:

1s) Find the proper address in the LinkedColoredMap. On this example it is located at  "LinkedColoredMap.Data1.GrayColor"

On the address of the array we have some necessary info to look for.  We know that, the colored map we do have pixels that after converted generates a value of 1. We have a total amount of 12548 of them :greensml:

But which one to choose ? On this basic method, i´m choosing the pixel that have the value of it´s luminosity (in the colored map related to gray 1 that is closer to the luminosity value of gray 1.

Note: Since gray = 1 have a luminosity range equal or bigger then 2.741066938704112e-1 (Mininum) and smaller then 5.48213387740825841e-1 (Maximum), i´m looking for the match of the minimum gray value. Just for a convention, i could, as well, look for the average, or the maximum etc, but, in fact, for CieLab/CieLCH colorspaces, it really does not matter either the range is minimum or maximum, since they will always represents the very same color that produces that specific gray if they have the same hue. When we are dealing with pixels on the same hue (or very very tiny variances) this difference between maximum and minimum range of luminosity is not enough to make the color be distinguishable. Thus the colors on that range are similar to each other. Also eventual differences in chromacity (for the same hue) are also not enough to make them perceptually distinguishable.  This is why, for convention, mainly, i´m using the minimum value of luminosity

Continuing....if i´m trying to colorize a gray pixel of 1 i search for the valid correspondence in the Sample Map from the Linked Colored one. Thus, i´ll point it to  Pixmap.Data2.PixAddress and then select the very 1st array of that structure.

Why, selecting the 1st and not the others 12547 ? Because on this basic method, i´m selecting the very 1st one whose luminosity bests approaches to gray = 1 . In case, the one that is closer to MinLuma = 2.741066938704112e-1. I´m discarding (On this tests only) all others 12547 so i can later use in another method (pixel neighbour  seraching for average luminance, or looking for the Standard Deviation, or looking for exact pattern match like: looking for the exact position of the neighbor pixels etc).

It will points then to the Pix map address corresponding to gray = 1, that is the address at Pixmap.Data2.PixAddress. Since the PixMap is already ordered, the luma of it at PixMap.Data2.Luma will always be the one that is closer to the one i want to find. In case, both are "2.741066938704112e-1".

Then, i simply grab the values of Chroma and Hue from PixMap.Data2.Chroma and PixMap.Data2.Hue and convert them to color with a CieLCh to RGB function. But...i´m transfering Chroma and Hue and what about Luma ? Which one to choose ? The grayed one or the colored one ? Well...The intention is preserve the luminosity of the gray image and transfer only the Chroma and Hue from the Colored one. So when converting back from CieLCH to RGB, i use the luma related to the gray pixel (in the gray image) and uses the Chroma and Hue from the colored image.

2) What if the colored image don´t have the same gray value as the one we are trying to colorize ?

We then look for the next gray correspondence and search for the one whose luminosity best approaches. That´s why i created the "nextvalidptr" and "prevvalidptr" members. For example, if i´m trying to colorize pixel 1 in the gray image, but, on the colored image there no such a thing as a gray pixel of 1, i start looking for the previous and next valid gray colors.

On this situations, i start looking for NextValidPtr (A pointer to the color map as explained) and store the value of it´s gray value (and luminosity) and do the same for the previous one. Then i calculate a difference among them. The one that have the smaller difference is the best candidate to colorize.

Ex:

GrayPixel = 5 , but on colored image the nearest correspondent gray is 1 (From previous pointer to pixmap) and 6 (from next point to pixmap)

So i do:

Delta (In absolute value) = Gray - prev = 5-1 = 4
Delta (In absolute value) = Gray - next = 5-6 = 1

Ok, so i know that the value on the colored map that most approaches to the gray one is the "next Ptr" whose gray correspondent value is 6. Why ? because "6" is closer to "5" than "1" is closer to "5".  What happens when both deltas are the same ? Well..on this basic method, it choose any of them ;)

Now i simply point to the address at Pixmap corresponding to Gray = 6 and grab it´s hue and chroma and do the transfer on the same way as explained previously.

So, to convert from gray to color i´ll use the Luminosity of my Gray Pixel (in case: 5 whose Minimum Luma = 1.37053346935206344) and use the values of Chroma and Hue from  PixMap.Data6.Chroma/PixMap.Data6.Hue ("PixMap.data6" is an example, the address are whatever the ones corresponding to the start on the pixmap from a gray value = 6)


Quote
The question is which color space model do we use for this?

CieLCh is the one that best produces an accurate result. The correspondence between luminance and gray are more accurate in CieLCH/CieLab colorspaces then YUV, HSL etc etc

I´m not using Cielab because i found out that CieLab produces inaccurate results on the conversion back to RGB, that´s why i´m trying to fix the problems on CieLCH 1st. Also we can use Chroma and Hue to a more accurated method of color transfer as, for example, when we are looking for the neighbor pixels.

With CieLCH we can reduce substantially the total amount of samples to be used, since we can simply discard all samples that are the result of the analysis of the very same color. (Even if the RGB values are different, they may represent the same color.) It´s the difference between using 307.200 samples and using only 1000 of them.

Also, with CieLCH colorspace we can produce a image more accurate. Since we have the ability to remove all duplicated colors we can simply create new ones from their neighbors if needed. This maybe helpful to enhance the general quality of a image or even for tracking techniques in video processing. For tracking, pattern recognition, motion estimation etc, once we remove duplicated colors it will be easier to search for the similar ones on he next frame of a video, for example.

Quote
We could calculate it on the fly and use Chebyshev polynomials to do the trigonometry calculations....
Maybe we can simplify the trigonometry by using precalculated lookup tables?

Yes, this is what i did with the KfactorMap as explained on the pdf. The computation of almost 60% of the code is done before the main RGB to CieLCh and CieLCH to RGB starts. It builds the necessary look up tables to be used. The only thing i did not succeeded is to optimize the sin/cos/atan functions, but  it can be done with the Chebyshev algorithms you did (It is needed a fast a atan too using that method if possible ;) ). This would increase the speed a lot. Using the LUTs i already had a major advantage in terms of speed since i removed completely the math involving the power/logarithm functions and used a simple search on the LUT as provided by JJ´s (Which is amazing fast, btw. Sure, it do have one minor issue yet, but he is working it :)  )

As i mentioned, in my preliminary tests of colorizing with the "normal" method people uses, the rendering was taking more then half an hour to colorize one single image (960*720) and now colorizes almost immediately (it takes 1 or 2 seconds already counting the time it takes to compute the tables etc - which can be done when the app starts too or, even better, making the algorithm loads external files related to samples that were previously created by the user). Using external files may speed up a lot too, since all the algorithm will do is point to the necessary address in the loaded file and voilá :greensml:

Quote
If we look at the CIE color space image below, we could precalculate the outer rim color values.
Interpolating between the 4 colors red, yellow, green, blue and red, we will end up with 1024 Chroma values which hold all the Chroma values we need.
So we have 256 (Luma) * 1024 (Chroma) = 262144 colors to choose from to construct the 256 color Destination Bitmap Image.

Maybe, but the problem of CIE colorspace (CieLab or CieLCH) is that they do produces inaccurate result on the backwards computation (CieLCh to RGB). If you feed the function with your own Luma or new Chroma/Hue the resultant values of RGB will not be accurate since they will be clipped inevitably. That´s why i´m trying to figure it out a way to know the limits or relation in between Hue/Chroma and Luma to prevent such clippings. So far, i guess i was correct in finding one of the relations, that is:

Chroma  <  [(1000/116)*(Luma+16)] / [5*cos(Hue)-2*sin(Hue)]

But...i´m not sure yet. It seems that for each Luma range (That also corresponds to the ranges of gray as i explained) we can have a same range for Chroma (or even Hue). On this way it would be better to find their limits. But i´m still clueless on how to fix that. There no single paper describing how to fix that. They all uses the CiE papers as if they were absolutely correct (which is not the case btw as demonstrated by bruce lindbloom at http://www.brucelindbloom.com/index.html?LContinuity.html)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 09, 2019, 03:07:02 AM
Hi Guys....

I maybe have found the limits for the CieLCH colorspace. If i did the maths correctly, the limits for Hue, Chroma and Luma are the following:

LumaMin = 0
LumaMax = 100

ChromaMin = 0
ChromaMax  = 185.6953381770518631465762238462182605619006952291306495751858162074705 (This is the maximum allowed value. The true computed limit for sRGB are a bit smaller, but within their own equations)


Since we have a Chroma and Luma Maximum Hue was supposed to have a maximum value too.
HueMax = 338.1985905136481882297551339130563354323346227336940419663º

This value is coincident with the limit for Gray. Whenever a pixel is gray it do have 2 hues of:
Hue when gray = 158.198590513648188229755133913056335432334622733694041966°
Hue when gray = 338.1985905136481882297551339130563354323346227336940419663º

Also, we have impossible values of Hue, when the formula divides by zero,. Those 3 angles should never be used:
68.1985905136481882297551339130563354323346227336940419663° (degrees) (Orange)
248.1985905136481882297551339130563354323346227336940419662° (degrees) (Blue)

Also, from the limits i found that Hue was supposed to cover all angles (Except those 2 ones that should never be used), but it do have gaps and extrapolations

When we are dealing with Maximum Chroma and Luma, Hue, was supposed to be limited to a 338.1985905136481882297551339130563354323346227336940419663º (The maximum degree where is related to gray) but, in fact, we do have pixels with a Hue of 359º 348º due to a failure on the general formula used for CieLab/CieLCh colorspaces. This fail causes the equations to clip the resultant values of RGB, which, in any extent, implies to conclude that we are retrieving other colors while trying to convert back.


I´ll see how to handle those limits for Hue properly after analysing the returning values of all 16 million colors. Most likely it seems that if i limit the hue angle to a max of 338º and min of 68º i may be able to recover the colors in between this gap. Or simply, will allow at the end the convertion of this gap, simply recomputing their croma and luma to avoid clippings. I´m thinking in what should be the best strategy to handle those colors whose Hue angles should never be there, in the 1st place
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 09, 2019, 05:06:10 AM
looking forward to what you make Siekmanski

I have decided to work on these kinds of macros and also make ***PD versions of them
http://masm32.com/board/index.php?topic=6802.0 (http://masm32.com/board/index.php?topic=6802.0)
maybe also need to make inttofloat and floattoint,maybe some MMX/SSE2 integer make
maybe it would be good to have some imageoriented macros too?,for example load and store rectangle pixels macros?
any suggestions of useful macros are welcome
maybe could help speed up colorization

Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 09, 2019, 06:36:01 AM
Macros for SSE are needed, indeed :t :t :t

Specially some of them related to conditional macros as "If", "Else", "EndIf", "Loop", "While" etc.  Did you have some macros example using SSE for conditional jmps ?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 12, 2019, 05:47:47 AM
Hi Jochen

This one is for 64 bIt. Do you have some info about 80 Bit - real10 ?

The bugs in mine version are probably because i forgot to include a routine to check for  numbers outside the range of the table
No, that's not a problem since we only feed valid numbers. The check can be omitted.

Quote
So, it could compare the index value found at eax, with the LOWORD of the HIDWORD of the next value (or previous one). If it also matches, then we have found the proper values.
Yes, something like that. But in most cases it's not necessary, since the numbers will be "distant" enough anyway. We are checking the high DWORD, i.e. the one that contains the exponent and part of the mantissa:
(https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/IEEE_754_Double_Floating_Point_Format.svg/618px-IEEE_754_Double_Floating_Point_Format.svg.png)
That "part" is 32-11-1=20 bits, or about 6-7 digits. Should be sufficient in most cases ;-)


I´m rewriting an old code to better displays the category of all FPU errors (when existent) for Real10. It was ok, so far, but probably there is a little bug when it tries to identify values such as:
[Value1: D$ 0, 0FFFFFFFF W$ 0] ; This number exceeds the limit for FPU. It is converted to 6.724206...e-4932 Does it exceeds the limit ?
[Value2: B$ 0FE, 07F, 0, 0, 0C0, 07F, 0, 0, 0, 0] ; This number exceeds the limit for FPU. It is converted to 5.1201424......e-4937 Does it exceeds the limit ?

Are those values/limit  correct ?

Code: [Select]

[SpecialFPU_QNAN 1] ; QNAN
[SpecialFPU_SNAN 2] ; SNAN
[SpecialFPU_NegInf 3] ; Negative Infinite
[SpecialFPU_PosInf 4] ; Positive Infinite
[SpecialFPU_Indefinite 5] ; Indefinite
[SpecialFPU_SpecialIndefQNan 6] ; Special INDEFINITE QNAN
[SpecialFPU_SpecialIndefSNan 7] ; Special INDEFINITE SNAN
[SpecialFPU_SpecialIndefInfinite 8] ; Special INDEFINITE Infinite

Proc RealTenFPUNumberCategory:
    Arguments @Float80Pointer
    Local @FPUErrorMode
    Uses edi, ebx


    mov ebx D@Float80Pointer
    mov D@FPUErrorMode &FALSE

    ...If_And W$ebx+8 = 0, D$ebx+4 = 0 ; This is denormalized, but it is possible.
        ; 0000 00000000 00000000
        ; 0000 00000000 FFFFFFFF
    ...Else_If_And W$ebx+8 = 0, D$ebx+4 > 0 ; This is Ok.
        ; 0000 00000001 00000000
        ; 0000 FFFFFFFF FFFFFFFF
    ...Else_If_And W$ebx+8 > 0, W$ebx+8 < 07FFF; This is ok only if the fraction Dword is bigger or equal to 080000000
        .If D$ebx+4 < 080000000
            .Test_If D$ebx+4 040000000
                ; QNAN 40000000
                mov D@FPUErrorMode SpecialFPU_QNAN
            .Test_Else
                If_And D$ebx+4 > 0, D$ebx > 0
                    ; SNAN only if at least 1 bit is set
                    mov D@FPUErrorMode SpecialFPU_SNAN
                Else ; All fraction Bits are 0
                    ; Bit 15 is never reached. The bit is 0 from W$ebx+8
                    ; -INFINITE ; Bit15 = 0
                    mov D@FPUErrorMode SpecialFPU_NegInf
                End_If
            .Test_End
        .End_If
    ...Else_If W$ebx+8 = 07FFF; This is ok only if the fraction Dword is bigger or equal to 080000000
        .Test_If D$ebx+4 040000000
            ; QNAN 40000000
            mov D@FPUErrorMode SpecialFPU_QNAN
        .Test_Else
            If_And D$ebx+4 > 0, D$ebx > 0
                ; SNAN only if at least 1 bit is set
                mov D@FPUErrorMode SpecialFPU_SNAN
            Else ; All fraction Bits are 0
                ; Bit 15 is never reached. The bit is 0 from W$ebx+8
                ; -INFINITE ; Bit15 = 0
;               Test_If W$ebx+8 = 0FFFF ; we need to see if Bit 15 is set
 ;                  ; -INFINITE ; Bit15 = 0
  ;             Test_Else
   ;                ; +INFINITE ; Bit15 = 1
    ;           Test_End
                ;mov D$edi '-INF', B$edi+4 0
                mov D@FPUErrorMode SpecialFPU_NegInf
            End_If
        .Test_End
        ; Below is similar to W$ebx+8 = 0
    ...Else_If_And W$ebx+8 = 08000, D$ebx+4 = 0 ; This is denormalized, but possible.
        ; 8000 00000000 00000000 (0)
        ; 8000 00000000 FFFFFFFF (-0.0000000156560127730E-4933)
    ...Else_If_And W$ebx+8 = 08000, D$ebx+4 > 0 ; This is Ok.
        ; 8000 01000000 00000000 (-0.2626643080556322880E-4933)
        ; 8000 FFFFFFFF 00000001 (-6.7242062846585856000E-4932)
    ...Else_If_And W$ebx+8 > 08000, W$ebx+8 < 0FFFF; This is ok only if the fraction Dword is bigger or equal to 080000000
        .If D$ebx+4 < 080000000
            .Test_If D$ebx+4 040000000
                ; QNAN 40000000
                mov D@FPUErrorMode SpecialFPU_QNAN
            .Test_Else
                If_And D$ebx+4 > 0, D$ebx > 0
                    ; SNAN only if at least 1 bit is set
                    ;mov D$edi 'SNaN', B$edi+4 0
                    mov D@FPUErrorMode SpecialFPU_SNAN
                Else ; All fraction Bits are 0
                    ; Bit 15 is always reached. The bit is 1 from W$ebx+8
                    ; +INFINITE ; Bit15 = 1
                    ;mov D$edi '+INF', B$edi+4 0
                    mov D@FPUErrorMode SpecialFPU_PosInf
                End_If
            .Test_End
        .End_If

    ...Else_If W$ebx+8 = 0FFFF; This is to we identify indefined or other NAN values

        .If_And D$ebx+4 >= 040000000, D$ebx = 0
            ; INDEFINITE
            mov D@FPUErrorMode SpecialFPU_Indefinite
        .Else
            .Test_If D$ebx+4 040000000
                ; Special INDEFINITE QNAN 40000000
                mov D@FPUErrorMode SpecialFPU_SpecialIndefQNan
            .Test_Else
                If_And D$ebx+4 > 0, D$ebx > 0
                    ; Special INDEFINITE SNAN only if at least 1 bit is set
                    mov D@FPUErrorMode SpecialFPU_SpecialIndefSNan
                Else ; All fraction Bits are 0
                    ; Bit 15 is always reached. The bit is 1 from W$ebx+8
                    ; Special INDEFINITE +INFINITE ; Bit15 = 1
                    mov D@FPUErrorMode SpecialFPU_SpecialIndefInfinite
                End_If
            .Test_End
        .End_If
    ...End_If

    ..If D@FPUErrorMode <> 0

        On B$edi-1 = '-', dec edi

        .If D@FPUErrorMode = SpecialFPU_QNAN
            push esi | zcopy {"QNAN ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_SNAN
            push esi | zcopy {"SNAN ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_NegInf
            push esi | zcopy {"-INFINITE ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_PosInf
            push esi | zcopy {"+INFINITE ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_Indefinite
            push esi | zcopy {"INDEFINITE ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_SpecialIndefQNan
            push esi | zcopy {"Special INDEFINITE QNAN ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_SpecialIndefSNan
            push esi | zcopy {"Special INDEFINITE SNAN ", 0} | pop esi
            mov B$edi 0
        .Else_If D@FPUErrorMode = SpecialFPU_SpecialIndefInfinite
            push esi | zcopy {"Special INDEFINITE +INFINITE ", 0} | pop esi
            mov B$edi 0
        .End_If

    ..End_If

    mov eax D@FPUErrorMode

EndP
Title: Re: Fast Compare Real8 with SSE
Post by: TimoVJL on February 12, 2019, 06:15:11 AM
https://en.wikipedia.org/wiki/Extended_precision
(https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/X86_Extended_Floating_Point_Format.svg/762px-X86_Extended_Floating_Point_Format.svg.png)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 12, 2019, 06:24:23 AM
Tks, TimoJVl

I read this, but didn´t fully understood

It shows the image as an 80 bit value, but the sign Bit is marked as "Bit 63" , but Bit 63 is for 64 Bit vfalues (Real4). The signed bit in Real10 is at bit 79.

I´m trying to find a similar scheme showing the Pseudo-Infinit etc etc, but for 80 bit.
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 12, 2019, 06:49:24 AM
This one is for 64 bIt. Do you have some info about 80 Bit - real10 ?

https://en.wikipedia.org/wiki/Extended_precision:
(https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/X86_Extended_Floating_Point_Format.svg/762px-X86_Extended_Floating_Point_Format.svg.png)

Note that in contrast to single and double, the REAL10 does have an explicit integer bit (more (http://www.website.masmforum.com/tutorials/fptute/fpuchap2.htm)).
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 12, 2019, 12:19:06 PM
Tks again, JJ

I read it both, but one thing i´m still not understanding. Raymond says that the limit for Real10 is from 3.36x10-4932 whereas wiki says it is  3.65×10−4951. So, what is the real limit ? I mean, if e-4951 is the true limit, then i presume my function was correct in convert them but, if not, i´m clueless on what i could be doing wrong.
Title: Re: Fast Compare Real8 with SSE
Post by: HSE on February 12, 2019, 01:09:08 PM
Just to contributed to confusion, that depends if you are talking about signed or unsigned REAL10.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 12, 2019, 01:56:16 PM
Oh my God. :dazzled: :dazzled: :dazzled: Holy Shmoly Batman :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: I give up ! I´ll open a bottle of beer and don´t stop drinking until next Monday :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: Errr....not the same beer :icon_rolleyes: :icon_rolleyes: :icon_rolleyes: :bgrin: :bgrin: :bgrin: :bgrin:
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 12, 2019, 07:09:09 PM
Raymond says that the limit for Real10 is from 3.36x10-4932 whereas wiki says it is  3.65×10−4951. So, what is the real limit ?

Wiki says (https://en.wikipedia.org/wiki/Extended_precision#Working_range) The 80-bit floating point format has a range (including subnormals) from approximately 3.65×10−4951 to 1.18×104932. Indeed an interesting difference 8)
Title: Re: Fast Compare Real8 with SSE
Post by: HSE on February 12, 2019, 09:37:34 PM
Intel manual (downloaded minutes ago) say that Double Extended Precision (80 bits) is signed and range is 3.37 × 10^–4932 to 1.18 × 10^4932 
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 12, 2019, 11:05:21 PM
Interesting info on the IEEE Standard for Floating-Point Arithmetic IEEE 754-2008 revision
http://www.dsc.ufcg.edu.br/~cnum/modulos/Modulo2/IEEE754_2008.pdf

https://steve.hollasch.net/cgindex/coding/ieeefloat.html
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 12, 2019, 11:38:50 PM
Intel manual (downloaded minutes ago) say that Double Extended Precision (80 bits) is signed and range is 3.37 × 10^–4932 to 1.18 × 10^4932

Wiki says (https://en.wikipedia.org/wiki/Extended_precision#Working_range) The 80-bit floating point format has a range (including subnormals) from approximately 3.65×10−4951 to 1.18×104932.

The 3.37 × 10^–4932 are normalised numbers. Below that you have approx 2^63 subnormal numbers with exponent all zero and decreasing precision.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 13, 2019, 12:18:07 AM
Too confusing.  :dazzled: :dazzled:

The problem is that the generated values of such things as "0FE, 07F, 0, 0, 0C0, 07F, 0, 0, 0, 0" do generates a valid number that exceeds IEEE (since this, in particular is subnormal). The function i made can actually convert such things on the same way ollydebugger does (except, on mine, i didn´t threw any msg of bad number (subnormal) as in olly. I achieved the same result as in Olly. (5.12...e-4937) but still not convinced what it is all about. If this (Subnormal numbers) should be considered a NAN or not. And if it is, then what are the bits that could be settled to configure this as a NAN since the bits responsible for identify NAN´s simply does not identify this  ? And, btw, the FPU operators seems to accept this number without throwing any exception .

HSE, i read Intel manual, but i couldn´t found on them anything about those subnormal ranges. JJ is correct about the lack of precision and the limits, but i don´t know how (or if) those numbers can be categorized as NANs for usage in a disassembler or debugger, for example. I realize that such values shouldn´t be a big beef (afterall who on Earth will use values such as 1e-4937 ?) but, for a disassembler point of view, it maybe necessary to know what to do in order to avoid cascading errors. For example, if we have a table of bytes (or structures) that contains a mix of FloatingPoints (Real10) and single bytes, if those numbers were considered valid, the whole strucure will be converted to Float units. However, if they are really to be considered NAN´s, then the disassembler will simply discard those bytes as true numbers and display the corresponding bytes rather then the float, and this bytes may also be part of pointer to addresses, for example. So, to avoid cascading errors, it is reasonable think whether those subnormal may represent valid Floats (although imprecise) or not

I start wondering if posit is really better then IEEE floats.

https://supercomputingfrontiers.eu/2017/wp-content/uploads/2017/03/2_1100_John-Gustafson.pdf
https://github.com/libcg/bfp
https://posithub.org/docs/PositTutorial_Part1.html

Maybe it´s worth to give a try later ?
Title: Re: Fast Compare Real8 with SSE
Post by: HSE on February 13, 2019, 12:48:09 AM
Wiki says (https://en.wikipedia.org/wiki/Extended_precision#Working_range) The 80-bit floating point format has a range (including subnormals) from approximately 3.65×10−4951 to 1.18×104932.
Working with exceptions then is 3.65×10−4951 to 1.18×104951, because also there are denormalized positive numbers
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 13, 2019, 12:59:10 AM
With Olly, a Num dd 0,  80000000h, 1h is the limit.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 13, 2019, 02:05:10 AM
With Olly, a Num dd 0,  80000000h, 1h is the limit.

I´ll have to do the same thing then and insert a flag of bad number for the debugger (displaying a msg as in ollydbg) and on the disassembler as well. Damn ! I wonder why why why people uses such things or why compilers can´t simply adjust their precision ? I´ll have to plan a whole different strategy now for the disassembler since compilers can generate such bad numbers and see how can i properly identify them as a number (although bad ones) or a chunk of bytes that can be used as pointers or whatever.

I was giving a test on such numbers using old apps i have since the late 90´s. I found those numbers while disassembling an old version of Mirc that used a library containing functions such as: logl, f87_Log, matherrl that contained such numbers displayed on a Array of Real10 Floats. It was compiled using BCC 4/5 :greensml: :greensml:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 13, 2019, 02:48:21 AM
Interesting info on the IEEE Standard for Floating-Point Arithmetic IEEE 754-2008 revision
http://www.dsc.ufcg.edu.br/~cnum/modulos/Modulo2/IEEE754_2008.pdf

https://steve.hollasch.net/cgindex/coding/ieeefloat.html

Tks a lot, Siekmanski. It will help to properly categorize those situations  :t :t :t :t
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 13, 2019, 03:28:21 AM
Ok, so according to the documents and information, it´s safe to conclude that whenever the last Word of a TenByte is 0, the number is subnormal and have his range outside the maximum values, right ?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 13, 2019, 08:19:34 AM
If you guys are interested, here is a nice article

https://www.codeproject.com/Articles/25294/Avoiding-Overflow-Underflow-and-Loss-of-Precision
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 15, 2019, 06:59:57 AM
Hi guga,
The color space routine is finished and I hope it is as should be? (not 100% sure, I don't have reference bitmaps to check against)
Have to write the stuff to combine the source-luma and the reference-chromas to get the destination bitmap.
Next week I'll write the rest.

(http://members.home.nl/siekmanski/ColorSpace.png)
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 16, 2019, 05:50:13 AM
Good work Siekmanski  :t
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 16, 2019, 06:18:43 AM
Thanks Magnus, hope I didn't screw up the calculations....  :biggrin:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 16, 2019, 04:00:58 PM
Wonderfull, Siekmanski. :t :t :t


I finished some calculations about the CieLCH routines. CieLCH/CieLab are way better then YUV colorspace, but still have some issues on the backwards computation. I succeeded to find the limits for Chroma And Hue on the RGB to CieLCH convertion. So, whenever you need to readjust luma or chorma to a new RGB it won´t mess up the result.

If you need i can upload the updated pdf for you. One thing i found is that the hue from CieLCH is supposed to have a limit of sqrt(29) According to this:

Chroma < [(1000/116)*(Luma+16)] / [5*cos(Hue)-2*sin(Hue)]

And for the maximum value for Hue, it must obey this limit: 5*cos(Hue)-2*sin(Hue) <= sqrt(29)

The only problem about those limitations of CieLab/CieLCH is that Hue was to be restricted to an angle of something around 338º but, it is extrapolating in 1 quadrant (So, plus 90º). One solution is reduce Hue forcing it to be restricted to 0 to 270º

The problem on CieLab relies on this:

Luma = YFinal*116-16
a = (XFinal-YFinal)*500 = AFactor
b = (YFinal-ZFinal)*200 = BFactor

The bFactor is extrapolating the Hue. I`m quite sure, the ratio is 3/4. So, perhaps, multiplying the final "b' with 3/4 could fix the Chroma and also the hue forcing it to stay in the limit of only 3 quadrants


To convert back from CieLCH to RGb you must 1st see if the relation between Hue, Chroma and Hue fits according to this formula:

Chroma < [(1000/116)*(Luma+16)] / [5*cos(Hue)-2*sin(Hue)]

But it probably needs to reduce Chroma (reducing "b' with 3/4) to make the hue angle stay on the proper limits.

Tomorrow i´ll try making some more calculations  to check if it is correct those limits on the backward convertion. Raymond, JJ and AW helped me with a problem i was having on the FPU routines i use to analyse all of this :)
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 16, 2019, 07:47:59 PM
CieLCH seems to be the best option but also the most complicated color space conversion algorithm.
For me this is a new field to explore and have to learn a lot more to understand it fully.
Looking forward to see your CieLCH routine when finished.  :t

Found this paper: http://jcgt.org/published/0002/02/01/paper.pdf

In my own logic, I always try to understand algorithms by working my way backwards.
Then I try to simplify the calculations if possible.
In the case of color space conversion you can precalculate the coefficients for a dot3 matrix routine.

Once you know the 3*3 matrix coefficients for the "forward" RGB -> XYZ calculations,
you can use the inverse of the 3*3 matrix coefficients to compute the "backwards" XYZ -> RGB.
This way, the "backwards" results will always be correct.

Bruce Lindbloom has done some of the coefficients math for us to compute the RGB -> XYZ and XYZ -> RGB matrices.
http://www.brucelindbloom.com/Eqn_RGB_XYZ_Matrix.html

There is a CIE Color Calculator on his site:
http://www.brucelindbloom.com/ColorCalculator.html

My routines are not finished yet but I will post them when ready, then we can check if it is done right.

Here is the dot3 color conversion routine I wrote, the one is used in the example above in reply #70.
I wrote it in SSE2 instructions so it can be used on older computers as well. ( so no fancy byte shuffles in this one. )
Adjusted the 3*3 matrix transpose routine to handle 4 row elements preserving alpha

Code: [Select]
                ;    B          G          R          A
CIERGB2XYZ  real4  0.2006017, 0.3106803, 0.4887180, 0.0 ; X
            real4  0.0108109, 0.8129847, 0.1762044, 0.0 ; Y
            real4  0.9897952, 0.0102048, 0.0000000, 1.0 ; Z ( AZ must be 1.0 )

ALPHA_mask  dd  -1,-1,-1,0


align 4
ColorConversionInt2Float proc uses ebx esi edi BitmapWidth:DWORD,BitmapHeight:DWORD,pSourceMem:DWORD,pDestinationMem:DWORD,pConversionType:DWORD

    mov         esi,pSourceMem
    mov         edi,pDestinationMem
    mov         edx,pConversionType
   
    mov         ecx,BitmapWidth
    imul        ecx,BitmapHeight
    shr         ecx,2
   
    pxor        xmm5,xmm5                   ; Empty the source operand, to zero the integer high parts,
                                            ; in the "punpcklbw", "punpcklwd" instructions
align 16
LoadFourPixels:
    mov         ebx,4
    movdqa      xmm6,oword ptr [esi]        ; Load 4 ARGB pixels at once

FourPixelLP:
    movq        xmm0,xmm6                   ; 1 pixel
    punpcklbw   xmm0,xmm5                   ; Convert 4 bytes to 4 words
    punpcklwd   xmm0,xmm5                   ; Convert 4 words to 4 dwords
    cvtdq2ps    xmm0,xmm0                   ; Convert 4 dwords to 4 real4 values

    movaps      xmm1,xmm0                   ; [B G R A]
    movaps      xmm2,xmm0                   ; [B G R A]
    mulps       xmm0,oword ptr [edx]        ; [BX GX RX  --] Multiply Color X conversion coefficients
    mulps       xmm1,oword ptr [edx+16]     ; [BY GY RY  --] Multiply Color Y conversion coefficients
    mulps       xmm2,oword ptr [edx+32]     ; [BZ GZ RZ  AZ] Multiply Color Z conversion coefficients

    ; Color conversion using an adjusted SSE2 4*3 matrix transposition routine ( preserving Alpha_Z)
    ; Now we can run a fast Dot3 ( 3-component vector ) calculation on the
    ; 12 color components and 12 color coefficients ( 9 of each + 3 Alpha components )
    ; Calculations are in parallel, 3 muls and 2 adds
   
    movaps      xmm3,xmm0                   ; [BX GX RX --]
    movaps      xmm4,xmm2                   ; [BZ GZ RZ AZ]
    unpcklps    xmm3,xmm1                   ; [BX BY GX GY]
    unpcklps    xmm4,xmm4                   ; [BZ BZ GZ GZ]
    movhlps     xmm4,xmm3                   ; [GX GY GZ GZ]
    movlhps     xmm3,xmm2                   ; [BX BY BZ GZ]
    unpckhps    xmm0,xmm1                   ; [RX RY -- --]
    shufps      xmm0,xmm2,Shuffle(3,2,1,0)  ; [RX RY RZ AZ]
    shufps      xmm6,xmm6,Shuffle(0,3,2,1)  ; pre-load next ARGB pixel
    addps       xmm3,xmm4                   ; [BX+GX BY+GY BZ+GZ GZ+GZ]
    andps       xmm3,oword ptr ALPHA_mask   ; [BX+GX BY+GY BZ+GZ --]   
    addps       xmm0,xmm3                   ; [RX+BX+GX RY+BY+GY RZ+BZ+GZ AZ]
                                            ; result: BGRA
    movaps      oword ptr [edi],xmm0        ; Store BGRA Pixel in Real4 format
    add         edi,16
    dec         ebx
    jnz         FourPixelLP
    add         esi,16
    dec         ecx
    jnz         LoadFourPixels
    ret

ColorConversionInt2Float endp
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 16, 2019, 08:28:49 PM
Guga please post PDF,I dug up and found SSE/SSE2 tutorial and posted it in my SSE thread to keep it from disappearing among lots of posts
I think if you understand the way of packed compares work,you dont need to be stuck with lots of packed SSE code bottlenecked with scalar IF's
I want to make a realtime colorization demo,starting with a simpler colorspace to begin with ,if CieLCH is too slow

Siekmanski great work :t
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 16, 2019, 11:29:41 PM
CieLCH seems to be the best option but also the most complicated color space conversion algorithm.
For me this is a new field to explore and have to learn a lot more to understand it fully.
Looking forward to see your CieLCH routine when finished.  :t

Found this paper: http://jcgt.org/published/0002/02/01/paper.pdf

In my own logic, I always try to understand algorithms by working my way backwards.
Then I try to simplify the calculations if possible.
In the case of color space conversion you can precalculate the coefficients for a dot3 matrix routine.

Once you know the 3*3 matrix coefficients for the "forward" RGB -> XYZ calculations,
you can use the inverse of the 3*3 matrix coefficients to compute the "backwards" XYZ -> RGB.
This way, the "backwards" results will always be correct.

Bruce Lindbloom has done some of the coefficients math for us to compute the RGB -> XYZ and XYZ -> RGB matrices.
http://www.brucelindbloom.com/Eqn_RGB_XYZ_Matrix.html

There is a CIE Color Calculator on his site:
http://www.brucelindbloom.com/ColorCalculator.html

My routines are not finished yet but I will post them when ready, then we can check if it is done right.

Here is the dot3 color conversion routine I wrote, the one is used in the example above in reply #70.
I wrote it in SSE2 instructions so it can be used on older computers as well. ( so no fancy byte shuffles in this one. )
Adjusted the 3*3 matrix transpose routine to handle 4 row elements preserving alpha

Code: [Select]
                ;    B          G          R          A
CIERGB2XYZ  real4  0.2006017, 0.3106803, 0.4887180, 0.0 ; X
            real4  0.0108109, 0.8129847, 0.1762044, 0.0 ; Y
            real4  0.9897952, 0.0102048, 0.0000000, 1.0 ; Z ( AZ must be 1.0 )

ALPHA_mask  dd  -1,-1,-1,0


align 4
ColorConversionInt2Float proc uses ebx esi edi BitmapWidth:DWORD,BitmapHeight:DWORD,pSourceMem:DWORD,pDestinationMem:DWORD,pConversionType:DWORD

    mov         esi,pSourceMem
    mov         edi,pDestinationMem
    mov         edx,pConversionType
   
    mov         ecx,BitmapWidth
    imul        ecx,BitmapHeight
    shr         ecx,2
   
    pxor        xmm5,xmm5                   ; Empty the source operand, to zero the integer high parts,
                                            ; in the "punpcklbw", "punpcklwd" instructions
align 16
LoadFourPixels:
    mov         ebx,4
    movdqa      xmm6,oword ptr [esi]        ; Load 4 ARGB pixels at once

FourPixelLP:
    movq        xmm0,xmm6                   ; 1 pixel
    punpcklbw   xmm0,xmm5                   ; Convert 4 bytes to 4 words
    punpcklwd   xmm0,xmm5                   ; Convert 4 words to 4 dwords
    cvtdq2ps    xmm0,xmm0                   ; Convert 4 dwords to 4 real4 values

    movaps      xmm1,xmm0                   ; [B G R A]
    movaps      xmm2,xmm0                   ; [B G R A]
    mulps       xmm0,oword ptr [edx]        ; [BX GX RX  --] Multiply Color X conversion coefficients
    mulps       xmm1,oword ptr [edx+16]     ; [BY GY RY  --] Multiply Color Y conversion coefficients
    mulps       xmm2,oword ptr [edx+32]     ; [BZ GZ RZ  AZ] Multiply Color Z conversion coefficients

    ; Color conversion using an adjusted SSE2 4*3 matrix transposition routine ( preserving Alpha_Z)
    ; Now we can run a fast Dot3 ( 3-component vector ) calculation on the
    ; 12 color components and 12 color coefficients ( 9 of each + 3 Alpha components )
    ; Calculations are in parallel, 3 muls and 2 adds
   
    movaps      xmm3,xmm0                   ; [BX GX RX --]
    movaps      xmm4,xmm2                   ; [BZ GZ RZ AZ]
    unpcklps    xmm3,xmm1                   ; [BX BY GX GY]
    unpcklps    xmm4,xmm4                   ; [BZ BZ GZ GZ]
    movhlps     xmm4,xmm3                   ; [GX GY GZ GZ]
    movlhps     xmm3,xmm2                   ; [BX BY BZ GZ]
    unpckhps    xmm0,xmm1                   ; [RX RY -- --]
    shufps      xmm0,xmm2,Shuffle(3,2,1,0)  ; [RX RY RZ AZ]
    shufps      xmm6,xmm6,Shuffle(0,3,2,1)  ; pre-load next ARGB pixel
    addps       xmm3,xmm4                   ; [BX+GX BY+GY BZ+GZ GZ+GZ]
    andps       xmm3,oword ptr ALPHA_mask   ; [BX+GX BY+GY BZ+GZ --]   
    addps       xmm0,xmm3                   ; [RX+BX+GX RY+BY+GY RZ+BZ+GZ AZ]
                                            ; result: BGRA
    movaps      oword ptr [edi],xmm0        ; Store BGRA Pixel in Real4 format
    add         edi,16
    dec         ebx
    jnz         FourPixelLP
    add         esi,16
    dec         ecx
    jnz         LoadFourPixels
    ret

ColorConversionInt2Float endp

Great work :t

About the transposing matrix. Yes. The backwards calculation uses the inverse matrix transposed. All of this is precalculated way before the routines RGBtoCieLch/CieLCHtoRGB functions starts. One thing only, don´t forget to include the gamma adjustments and white reference i told on the paper.


Bruce has done a great job on all of this, but stil the problem on this colorspace in particular remains. He managed to fix the discontinuity problem adjusting the threshold after the RGB is converted to XYZ before converting it to Lab/LCH but, what he didn´t thought is to check the limits between Hue/Chroma and Luminosity that do have some issues if you use the CIE formula without the necessary fixes.

If you use the formula on the "normal"  way you will end up having to clip the resultant R, G or B on the backwards computation. So, when yu are doing the backwards computation you always needs to check if the relation beetween chroma/hue and luma fits to tjhe formula i proposed.

About hue, the delta hue i showed previously [5*cos(Hue)-2*sin(Hue)] seems to be limited to sqrt(29). I´ll finish the paper updating it to include the backward formula and post it here for you see.





Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 16, 2019, 11:46:32 PM
Guga please post PDF,I dug up and found SSE/SSE2 tutorial and posted it in my SSE thread to keep it from disappearing among lots of posts
I think if you understand the way of packed compares work,you dont need to be stuck with lots of packed SSE code bottlenecked with scalar IF's
I want to make a realtime colorization demo,starting with a simpler colorspace to begin with ,if CieLCH is too slow

Siekmanski great work :t

Hi DayDreamer. I´m finishing some details on the algorithm and will update with the backwards algorithm for you and Siekmanski
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 16, 2019, 11:54:57 PM
 :t
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 17, 2019, 02:20:06 AM
Hi DayDreamer. I´m finishing some details on the algorithm and will update with the backwards algorithm for you and Siekmanski
great  :t

Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 18, 2019, 03:01:52 AM
Hi Siekmanski, can you try something ?

I altered the original formula of CieLab to better fit the Chroma and Hue on CieLCH. Can you please give a test replacing this:

Code: [Select]
a = (X-Y)*500
b = (Y-Z)*200

with this.
Code: [Select]
a = (X-Y)*(5/2)*225
b = (Y-Z)*225

It seems that this new formula better fits to the Hue and Chroma and don´t extrapolates the result (I hope)


There is a problematic gap on the original CieLab equation and i´m trying to fix this. If i did the math correctly, the new equation may be the one that corrects that problem. I wold like to see the results using your images as you did previously.

It will increase the chromacity on a factor of 1.125 but at the same time, forces the Hue to be restricted on their own limits as the ones i found from luma.

If this is correcet, then i may have found the correct boundaries/ranges for chromacity too :bgrin: :bgrin:

And then we can assume that we have only 256 different variations of luminance, each one having his ranges (fixed ones) for Chromacity as well allowing a range of Hue to be stayed in 0º to 360º without extrapolating and better spreading the hue angles accordying to each chroma/luma range.

So, the final CieLCH function can be even faster, simply pointing to a table containing the ranges for Luma and Chroma (That also can be used integers as well from 0 to 255, rather then Floating Points) and using only Hue with fractions. (Limited to the chroma/luma table too, i hope  :icon_mrgreen:)

Note:
Using b = (Y-Z)*225 seems compatible with some other perceptual colorspaces and gamma correction as well, since they tend to limit the RGB values to 253. So: 225*1.125 ~ 253

If we want to keep the range of 0 to 255, probably we could try also with this other formula too:


Code: [Select]
a = (X-Y)*(5/2)*(226+2/3)
b = (Y-Z)*(226+2/3)
since: (226+2/3)*1.125 = 255 ;) This last  one seems more logical using all range of 0 to 255, since we already corrected gamma and white reference previously. So, it seems to me unreasonable limiting the range to 253 as it is done on the 'normal' way.
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 18, 2019, 10:37:08 AM
Hi guga,

Still have to do more research before I can understand the CieLCH color conversion.

I think you are right, it is logical to use the complete range of 256 lumas or at least the maximum value of the lumas in the source bitmap.

In your case, where you want to combine the luma of bitmap A with the chromas of bitmap B to construct a new bitmap from bitmap A with new colors we only need to create a 256 colors palette.
Correct me if I'm wrong.

Not every bitmap has the full range of 256 luma values and most probably, some luma values that are present in bitmap A aren't present in bitmap A and vice versa.

So, I think in my own logic ( I have no experience in this matter and could be totally missing the point here  :icon_eek:) that we need the highest luma value from bitmap A as the total number of palette entries.

Then calculate the avarage ( maybe it is better to get the standard deviation or variance.... ) of each chroma pair with the same luma.
If there are missing luma values in bitmap B, we can interpolate between the neighbour chromas and assign the value(s) to the missing lumas.

Then there is another thing.
What do we do when there are less luma values in bitmap B as in bitmap A after we have calculated the missing in between lumas in bitmap B?
I think we can simply stretch the chroma values to the number of lumas in bitmap A.

These are just my thoughts and probably not the best way of doing this but, I will try it out.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 18, 2019, 01:33:15 PM
Hi SiekManski

Calculating the average or standard deviation is one way to do it too when the source bitmap has no luma as in the gray image.

What i did on this previous version was search for the nearest value rather then calculating the average.

The data in Source A (The colored one) contains a table with 256 structures whose members contains the pointers to the gray/luminance values on it. The gray is used as an index.

So, as i posted previously, on the Table from the colored one we have a pointer to all luminance existant on it but indexed according to the gray value. (That is nothing more than the luma range itself).

So, we are trying to colorized a pixel on the gray image whose gray is 100, but on the source image (the colored) we don´t have any pixel that will represent that same "gray", i looked for the one nearest to it. So i start looking on the table (containing the structures) if it has a "gray" 101, 102, 103 etc and do the same backwards looking for 99, 98, 97. Then i only calculate the nearest value.

Say on the colored image we have this sequence of 'gray" (On our array of 256 structure): 96 97 98 103 105 108. Since we are looking for pixel 100 (on the gray image), i get the previous and next "gray" that actually is present on the colored image. On this case it is 98 and 103. Then i simply choose the one that is closer to 100. In case i pick up 98. I do that because 98 is closer to 100 then 103 is closer to 100. I use their deltas, rather then computing the average between them. So, Delta1 = 100-98 = 2   Delta2 = 103-100 = 3. The smaller delta (2) is the pixel nearest to the value of 100. So, i use 98 to fill the gap.

Of course, we could fill the gap simply calculating the average among them (103+98)/2 = 100.5. But remember that we are using luminance/gray as an index on a table ? They are in integer values. So we could never find the proper value when it comes to a fraction. That´s why i´m choosing the nearest one on this preliminary tests.

You can´t stretch the chroma randomly. If you do that, you will be out of the range of hue too. The best always looking for the luminance and using the table (described below) to retrieve the chroma/hue back. If the luma don´t exists, you can do as i said previously on finding the nearest, or even estimating the average or doing it by the neighbor etc. What you can´t do is rely on chroma or even on hue for retrieve that back. You always need to use the luminance as the guidance of the equations.

The best choice to extend the limitation of 256 colors is calculating the luminance of the neighbour pixels. Similar to when we are using a signature to identify images, for example. On this way, we can also retrieve texture as well. We only need to use a small matrix to do this. Let´s say, 3*3 of the gray pixels and 3x3 on the colored ones. And then we look 1st for their positions for a full match. Since the odds that one 3x3 matrix  is exactly the same as another is too low it is unlikely that an error is generated. You may think that we will need a huge database of 3x3 pixels to be used as samples, but, probably not. If we succeed to fix the CieLCH equations we will limit the total amount of colors to search and, by consequence, the total amount of luminance combinations as well.

From all those 16 million of colors, probably more then a half of it are perceptual similar to another one. I didn´t estimated yet the amount of unique colors, because i´m struggling trying to fix that damn CieLab/CieLCH equations, but the goal is forcing the algorithm (The RGBtoCieLCh function and it´s reversal) to be restricted on a table containing only

256 gray colors (each gray color correspond to a unique luminance value)
256 chroma values (same as above, each chroma value seems to be limited to the same range as the luma and the difference between the max and min are extremelly low)
360 (or most likely 338) different hue angles

The hue angle is being a particular problem.  The original equation simply does not fits to the results. There is a difference in the chroma/hue fraction of 1.125. I fixed it as proposed, but, the difference remains there. perhaps, i´ll need to multiply Y with 1.125 e divide X and Z by [2/(3-1.125)] to adjust that properly.

I didn´t tested yet. Whenever i fix one part, this difference appears on another. The problem is that X, Y and Z are attached to each other biasing on the matrix multiplication of Red, Green and Blue with those tristimulus values from sRGB for example.  If i simply multiply Y, i´ll end up with an error, because X and Z also uses the same Red, Green and Blue values just multiplied with their own tristmulus values.


I know i´m close to the solution, because the difference of 1.125 seems to be a fixed value, but i can´t actually see where to fix it. The new paper i wrote is a complete mess, btw. I do the equations and write it down on the doc, but when i see an error, i put it on the same paper to don´t forget later what i was doing. :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:


The good thing is that the properties of Luma, Hue and Chroma to be used on the backward computation are all there now :) I simply need to adjust the equation for fit the relation between chroma and hue values.

What i did found about Chroma is that i can put them on the same table as in Luma range and ended up in something like this:

Code: [Select]
Gray/Color        Luminosity (Min)                         Chroma Min
      0                         0                                                 0
      1                         2.741066938704112e-1             26.0519460866180125
      2                         5.48213387740825841e-1             26.49074207984956658


So, when a pixel have the value a luminosity in the range of 2.74e-1 to 5.48e-1, it means , in fact, that this pixel will necessarily generate a gray color whose value is "1" (That we can use as an index)
It also means that whenever a pixel falls on the range of gray "1" (No matter what was the original combination that created that gray=1)  it will always have a chromacity range from 26.05194 to a max of 26.49074207....

The problem of the "normal" way is that when they do the conversion they don´pt check those kind of limits and accept extrapolations on the backward computation (CieLCh to RGB). So, it will inevitably clip the final values of R, G, B.

Say that we are trying to decrease/increase the chromacity of an image. On the "normal" way we can use whatever luminance we want with whatever chroma value, but, doing that we will clip the result generating a color that simply was not supposed to be there.
So, using a table, it is easier to make the values be restricted on their own boundaries.
If we are looking for luma = 5.6e-1 (or looking for gray = 2. It really don´t matter because they are the same thing. Gray=Luma range), all we need to do is search for it on the table where the luminance have a value of 2 (for example, that corresponds to that range of 5.48...e-1 to xxx) and take the chroma from there.

And the hue you may be able to use the one you inputted. What it seems is that the difference of chromacity is not enough to make a color be distinguishable from another. What it do matters is the hue angle.

Now...all i need to know is that if there is a similar limit for hue as well. (it seems there do exists, but i simply can´t find it using their own "normal' equations.)

The final result i hope we can get is something like this:

Code: [Select]
Gray/Color Luminosity (Min)             Chroma Min                            Hue Min
            0                         0                                     0
            1                         2.741066938704112e-1 26.0519460866180125               0
            2                         5.48213387740825841e-1 26.49074207984956658              35.1215º

If there is also the same limits for hue (Which seems to exists)  then the backward computation will be extremely easy as a pointer to tables.

What i´m finding in all of this is that the concepts of CIE are incorrect. Luminosity is simply the frequency of a luminous waveform, and what they call "chroma' is, in fact, the intensity of the luminosity over a pixel. And hue is the actual "chroma". They claims that Luminosity is completely isolated from Chroma and Hue, but that´s pure rubbish. Luma and "Chroma' seems to be faces of the same thing, but one is related to the wavelenght and other the intensity of it. Hue is the actual chroma that make our eyes identify different colors.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 18, 2019, 04:06:11 PM
About

Quote
In your case, where you want to combine the luma of bitmap A with the chromas of bitmap B to construct a new bitmap from bitmap A with new colors we only need to create a 256 colors palette.

On this stage, yes.  Using 256 different colors is the 1st stage of the colorization method. It is the easier to see if the algorithms works as expected before develop it on a more extensive way. But if we want to retrieve more colors  (65536 or 16 million), other techniques will need to be used, such the searching for neighbours pixels.

So far the results i´m having with a 256 colors palette are ok, way better then i expected, in fact (although still limited to 256). Once we succeed to make it works correctly with those 256 colors, then making it works on 655356 or 16 million of collors would be way easier, since the basics technique will be already developed and the colorspace conversion functions properly fixed.

Oh...another thing...i´ll put here as a note so i won´t forget later :icon_mrgreen: :icon_mrgreen: Since there is a table of fixed values for "Chroma" (That, in fact seems to be only the intensity of luma), then we can only uses 2 values for the CieLCh colorspace. We wil no longer need to use "Chroma' to retrieve any color since all we need to do to retrieve Chroma is pointing to the luma table that will also contain the fixed minimum values of chroma already precalculated (i only need to check how big are the differences of this range of chroma to see if the minimum values can also be used on a fixed way). All we may need is Luma and Hue. I´ll make some more tests before trying to fix the equation to see if this can also be a possibility to use. If this is correct, then we may have created a new perceptual colorspace with less variables to use :bgrin: :bgrin: :bgrin:

Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 18, 2019, 05:53:25 PM
Can't wait to see the results of your approach.  :t
I still have a lot to digest.  :bgrin:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 21, 2019, 02:22:41 AM
Ok, guys, i finished the properties of the equations, but i´m having some problems to solve for Red, Green and Blue parameters.

I´m doing the equations backwards (i.e.: reversing it to make the proper fixes) since Red, Green and Blue are directly related to X, Y, Z.

What is the better way for posting the equations for you ? PDF, doc, simple text ?

I found some interesting properties for "a" and "b' and also from x, y, z, but as i said, using their own equation won´t work at all. I found the limits for all of those monster, but i´m having trouble to go deeper reverting the algorithm until it´s original values for Red, Green and Blue.

The goal is to make the function fixes the inputed values of Red, Green and Blue before going deeper on the computations of XYZ->CieLab->CIELCH. Since i found their properties, the backwards computation seems ok too. But, fixing the inputed values is needed to be done 1st on the RGB to CieLCH

Can someone help ?.
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 21, 2019, 06:09:48 AM

Can someone help ?.
please post pdf ,I dont know if I can help,its complex formulas
would it help if you check 8x8 pixels simultanously?


Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 21, 2019, 06:21:25 AM

What is the better way for posting the equations for you ? PDF, doc, simple text ?

Whatever you like.

Quote
I found some interesting properties for "a" and "b' and also from x, y, z, but as i said, using their own equation won´t work at all. I found the limits for all of those monster, but i´m having trouble to go deeper reverting the algorithm until it´s original values for Red, Green and Blue.

The goal is to make the function fixes the inputed values of Red, Green and Blue before going deeper on the computations of XYZ->CieLab->CIELCH. Since i found their properties, the backwards computation seems ok too. But, fixing the inputed values is needed to be done 1st on the RGB to CieLCH

Can someone help ?.

This is not an easy task, if you ask me. ( very complicated algorithm.... )
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 22, 2019, 05:30:17 AM
Would it be possible with fixed point/integer code guga?
Using SSE2 integer instructions?
Hue conversion with help of byteshuffles
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 22, 2019, 06:51:32 AM
Hi Guys, i´m finishing the pdf for you  :t

Daydreamer.
For the luminance values, it is possible to use integer since i succeeded to make Luminance be used as a look up table relating it to grays integers. For chromacity, i´m not so sure yet, since i found out that the range of chromacity  for each luma is bigger then i thought (Although restricted to each luma as well. For example, when luma is on the same range as gray = 190, chromacity ranges are from 67.23 to 157). I started yesterday creating a fraction to be used for chromacity to it stay on the range of 0 to 100 (or only from 0 to 255, too - that could be used as an integer/index) but didn´t tested it yet, since the equations that makes the relation between chromacity and hue are still problematic.
About Hue, i´m quite sure it cannot be done with integers, since the range for hue is bigger and the formula have issues on it when trying to force it to be inside it´s own limits.

Using SSE can ensure the speed for the multiplication of the values in the tristimulus matrix and also and on the look up table and the trigonometry functions that are needed to be use but, 1st it is important to fix the equations.

The problem of CieLab/CieLCH is that they are trying to emulate a spherical color space using values that simply don´t fix to spherical coordinates (i.e: coordinates, x, y, z - Not the colorspace XYZ, but coordinates itself, as plotting on a 3d graphic). So, what they call "hue" is not exactly an angle between 2 of the axis (Even though they uses red and blue to reproduce this "pseudo axis"). I tried to represent the functions in a 3d space using R, G, B as the axis coordinates, but didn´t succeeded to reproduce where the values of Hue, "a", "b' (and neither x, y, z) fits to it. It´s like you trying to put an elephant inside a ping pong ball :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :greensml: :greensml:

I´m posting the pdf today as soon i finished cleaning the equations for you.

To you guys have an idea of what i´m doing, here is a screenshot containing the luminance table and some values for Chromacity i inserted on it to fits to the ranges. The "Ws_Matrix" is a huge structure that holds all necessary information to be loaded previously the RGB to CieLCH and CieLCH to RGB starts. Once all necessary data is fully preloaded, the app runs way faster. (Can do better with SSe, for sure  :t)

Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 22, 2019, 05:12:46 PM
Quote
The problem of CieLab/CieLCH is that they are trying to emulate a spherical color space using values that simply don´t fix to spherical coordinates (i.e: coordinates, x, y, z - Not the colorspace XYZ, but coordinates itself, as plotting on a 3d graphic). So, what they call "hue" is not exactly an angle between 2 of the axis (Even though they uses red and blue to reproduce this "pseudo axis"). I tried to represent the functions in a 3d space using R, G, B as the axis coordinates, but didn´t succeeded to reproduce where the values of Hue, "a", "b' (and neither x, y, z) fits to it. It´s like you trying to put an elephant inside a ping pong ball :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :greensml: :greensml:

I think they don't fit because, RGB is spaced at 120 degrees steps and CieLCH is spaced in 90 degrees steps for the color range.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 22, 2019, 07:19:36 PM
Hi Siekmanski

Quote
I think they don't fit because, RGB is spaced at 120 degrees steps and CieLCH is spaced in 90 degrees steps for the color range.

Probably, but i figure it out that, in fact, CieLCH have a fixed range of 270º (and, of course 4 gaps - from the 'normal' equations. 2 for gray and 2 for non existent color. Those gaps are separated by 90º each. After the fix, one of them that represented non existent colors, is not used any longer. 8)  ). I managed to fix those damn equations and inserted a multiplicand fraction in the final X (prior to the conversion to CieLab -> CieLCH)

I found practically all limits now. The only thing i need to make sure is the minimum limit for Hue (It seems to starts at 0º now, but..i´ll have to make sure, because it started at 68º previously, immediatelly after the gap for non existent colors - now fixed). The maximum range is all there. It ends at approximate  Hue of 338º.

Also chromacity and Luma i succeeded to find a ratio for it. I named it as CLFactor whose maximum value is sqrt(29). I´m only needing now to check for the minimum value (most likely is 5, but i didn´t confirmed yet. Too tired  :dazzled: :dazzled:)

I´m posting here the complete algorithm with the necessary fixes and all properties, limits and how i achieved those results. It is only missing the minimum values described in the end of the paper (last chapter/section).

For what i saw so far, my preliminary thoughts was correct, Chroma is only the intensity of luma and they can be used together to retrieve a value of Hue.

Since i´m putting them on tables, it would be way easier to the backwards computation using all those limits. Only 256 will be necessary for the limits of Chroma and also Hue can be limited to each Luminance range as well.

Here is the pdf to you check. Sorry for the mess. I did my best to try to make this paper more readable. :greensml:

If you can, please, check the math involved to see if the logic and results are really those i found :t
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 23, 2019, 02:25:14 AM
Hi Guys, Gooooooood morning, Vietnam  :greensml: :greensml: :greensml: :greensml:

A small fix.

I was testing the results using the new XFinal, but found out errors on the ouput that still are extrapolating the hue angle and CLFactor (it is generating values bigger then sqrt(29)). While reviewing the equation, i probably mistaken when multiplied x by *(1-2/sqrt(29))

Since i was taking onto account the value YFinal to the maximum, i probably needed to adjust X to fits to Y and not making it also isolated.

The fix i did was simply making XFinal = YFinal*(1-2/sqrt(29)). Doing this, the resultant values no longer generated errors, except a few 111 (among 16 million colors) that were caused due to rounding error.

So, if i did it correct this time, then, the values of XFinal, YFinal and ZFinal are given only by:

Code: [Select]
If  Y > (216/24389)
    YFinal = Y^(1/3)
Else
     YFinal = Y*(841/108) + (16/116)
Endif

XFinal = YFinal*(1-2/sqrt(29))

If  Z > (216/24389)
    ZFinal = Z^(1/3)
Else
     ZFinal = Z*(841/108) + (16/116)
Endif


Doing this, the values of X, Y, and Z, produces the correct values for hue and chroma, limited to the CLFactor max of sqrt(29).

Can someone please check if this updated fix is correct ?

It´s unlikely, but if it´s correct, then probably i succeeded to restrict Hue angle on a true 3d spherical coordinates making the maximum range of the angles be restricted to something around 90º as i was trying to do last year with a new ColorSpace i was developing also biased on one quadrant of a 3d sphere
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 23, 2019, 11:00:12 AM
Hmm..forget the above post. It works on the RGb to CieLCh but don´t works on the reverse operation. We need to try fixing the original equations in the pdf. I´ll continue tonight.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 24, 2019, 01:23:33 AM
Did you guys understood what i tried to explain in the pdf ?

Is it possible to try fixing the equations on a way similar to this i posted in page 19 ?

Code: [Select]
If  X > (216/24389)
    XTmp = X(1/3)
Else
     XTmp = X*(841/108) + (16/116)
Endif
XFinal = XTmp*(1-2/sqrt(29))

If  Y > (216/24389)
    YFinal = Y(1/3)
Else
     YFinal = Y*(841/108) + (16/116)
Endif

If  Z > (216/24389)
    ZFinal = Z(1/3)
Else
     ZFinal = Z*(841/108) + (16/116)
Endif


It seems that XFinal can be retrieved with XFinal = XTmp*(1-2/sqrt(29)) (and perhaps ZFinal too), in order to the equation fits to the limit of sqrt(29), but i simply cannot find the proper way to fix those before the conversion to "a", 'b" (That generates chroma and hue) occurs.
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 24, 2019, 03:59:02 AM
Did you guys understood what i tried to explain in the pdf ?

Is it possible to try fixing the equations on a way similar to this i posted in page 19 ?

Code: [Select]
If  X > (216/24389)
    XTmp = X(1/3)
Else
     XTmp = X*(841/108) + (16/116)
Endif
XFinal = XTmp*(1-2/sqrt(29))

If  Y > (216/24389)
    YFinal = Y(1/3)
Else
     YFinal = Y*(841/108) + (16/116)
Endif

If  Z > (216/24389)
    ZFinal = Z(1/3)
Else
     ZFinal = Z*(841/108) + (16/116)
Endif


It seems that XFinal can be retrieved with XFinal = XTmp*(1-2/sqrt(29)) (and perhaps ZFinal too), in order to the equation fits to the limit of sqrt(29), but i simply cannot find the proper way to fix those before the conversion to "a", 'b" (That generates chroma and hue) occurs.
I dont know,couldnt you let computer run try different constants until you get the right results so forward conversion pixels until it matches backward conversion with a small tile?,or output graphic curve makes understand visual?

Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 24, 2019, 07:00:49 AM
Quote from: daydreamer
I dont know,couldnt you let computer run try different constants until you get the right results so forward conversion pixels until it matches backward conversion with a small tile?,or output graphic curve makes understand visual?

Hi Daydreamer.

I was doing it right now. I´m building a routine to find the maximum and minimum value of CLFactor to see if i can find how much it is bigger then sqrt(29).

Most likely, the bigger CLFActor is found on the smaller delta values (differences) between X, Y, Z. If i can find it using a few routines (I mean, calculating it from a maximum of 9 checks for different combinations using maximum R, G, B ranges using only 0 and 255 ) then probably i´ll see how much it is extending the limit in order to establish a fixing ration on XFinal, YFinal and ZFinal.

I´ll see if i can find the maximum values from the equation i created at page 21

(https://i.imgur.com/JG2A5xw.png)

Which is simply, the same as:

(https://i.imgur.com/2tQGQyQ.png)

I´m not sure about the results, because CLFactor is directly related to Hue and the last time i tried it this past week, i had to make the routines runs on all the 16 million colors to retrieve the maximum Hue. Also, probably i´ll have to use the cubic form (power of 3) on each X, Y, Z and CLFactor, due to the threshold limits that uses the power of 1/3 to retrieve those values, but, don't know yet

Not sure what will result, but i´ll give a try anyway.

I tried to use the CieLab on the 3d coordinates but was unable to understand how "a" and 'b' are related to the X, Y, Z axis. CieLab/CieLCH are a cylindrical colorspace trying to emulate a spherical one.  But uses those damn (x-y)*500 (y-z)*200 and thresholds to compute afactor, bFactor and thus, Hue and Chroma.

I have no idea how to represent the "x-y" and "y-z" on the 3d axis in order to try to fix it using basic geometry. I thought in displaying them on a sort of a cube containing inside the cylinder represented by CieLab, but don´t know how to represent this "x-y" stuff. I thought in sides of a cube forming the sides of the cylinder, but i´m clueless how to properly represent this.
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 24, 2019, 09:40:52 PM
We could get rid of the atan2, sin and cos trigonometric functions and calculate A and B to create the 3D space coordinates by creating a linear HUE table in the color conversion we want.

RGB color space:
luma = lightness, and brightness.
hue = color components = red(0 degrees) to green(120 degrees) to blue(240 degrees) to red(360 degrees)

(https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/HueScale.svg/360px-HueScale.svg.png)

As you can see, there is only a transition between 2 of the 3 RGB color components each 120 degrees.
We can use this as our A and B components.
The first 60 degrees is where A is RED 255,0,0 to YELLOW 255,255,0
The next 60 degrees is YELLOW 255,255,0 to GREEN 0,255,0
This completes the transition in 120 degrees between A and B in 512 steps.
A full 360 degrees = 3 * 512 = 1536 HUE color combinations.
 
These are all the integer hue color combinations ( without the luma component).

1536 * 256 (luma values) = 393216 RGB color combinations, which is 2,34375 % of all possible 65536 chroma values.
Thus we need a factor to calculate all 65536 different chromas.

65536/1536 = 42.66667 ( forward )
1536/65536 = 0.0234375 ( backward )

CIE color space:
red (0 degrees ), yellow(90 degrees), green(180 degrees), blue(270 degrees) and red(360 degrees)
A full 360 degrees = 4 * 512 = 2048 HUE color combinations.

(https://sensing.konicaminolta.us/uploads/ciecolorspace-29309si549.jpg)

65536/2048 = 32.0 ( forward )
2048/65536 = 0.03125 ( backward )

Now we can get A and B from a table of 2048 values using the factor or create a table with 65536 precalculated values.
I think 2048 values which only need 1 mul will be faster than a 65536 values without the mul. ( smaller data cache )

3D space:

X = A
Y = B
Z = Luma = ( R*0.1762044 + G*0.8129847 + B*0.0108109 ) ; is the Y value of the CIERGB2XYZ tristimulus matrix.

"And we don't need to mess with negative values because every value is automatically translated to the first quadrant ( X+, Y+ )"
"And we also don't need to use compare instructions because the table is a power of 2, we can use the AND instruction"

This is as far as I understand the CIE 3D colorspace, beware, I could be wrong though.......
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 25, 2019, 12:17:30 AM
Siekmanski
It's simple linear interpolate hue between two colors 0-60  and shuffle them according to which sector they are in
One color channel increases with angle, while the other decreases,maybe can be done in similar ways with CieLCh but with 90 degrees,if that doesn't work with linear interpolation,try with cosine or other curve
Cubic colorspace remind me of super fast raycasting
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 25, 2019, 12:44:32 AM
Hi Magnus,

Yes, that is exactly what I was thinking, hence the linear LUT method.  :t
We will see, I have to study CieLch a bit more to know what I'm doing.
If I read guga's paper, a lot of corrections are needed and makes the algorithm more complex then it already is.....

But it has my interest, wish I had more time left to spend on it......  :biggrin:
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 25, 2019, 01:32:45 AM
Hi guga,

Is this the correct route from RGB to CIElch?
rgb2xyz -> xyz2lab -> lab2lch and vice versa?
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 25, 2019, 02:28:41 AM
Quote
Is this the correct route from RGB to CIElch?
rgb2xyz -> xyz2lab -> lab2lch and vice versa?

Hi Siekmanski

yes, this is the correct route.

Just keep in mind that when converting from xyz2Lab, a correction must be done only in the x Axis (XFinal). I´m close to find a constant to be used in the xAxis before it is transformed to originate Lab values. The hue angle have a maximum around 338º. Those graphics on wiki are not precise because CieLCH forms a parabolic cylinder, but the "a" "b" used in CieLab to compute hue and chroma are not exactly you may call "axis". Hue does not occupy all 360º. That´s why the original CieLab function produces incorrect results but, i´ll probably close to find a constant for it. (In fact, i found one, but not sure if it will work on all color models yet. (sRGB, HDTV, Adobe tristimulus matrices).

Hue can be calculated from
hue = atan(5/2) + asin(-ClFActor*sqrt(29)/29) and using the image i posted previously we can get rid of the trigonometric computations using an approximation of the asin,acos, or atan functions, such as the chebyshev functions you described. I found three functions that uses an approximation in nvidia, but i was unable to port it yet. On this, way, calculating a hue with a constant (atan(5/2) plus a polynomial fucntion will be probably faster then using atan from Fpu

You can calculate CLFactor directly from x,y,z or a more direct way from chroma as:
CLFactor = 1000*y/Chroma . Note: "y" from xyz. If you want to use Luma, just replace y = (Luma+16)/116
Thus, hue = atan(5/2)+asin(-(1000*y/chroma)*sqrt(29)/29)

Can you give a try porting/optimizing these (With doubles- Real8, rather then float)?
Specially, this one (asin) 8)
Code: [Select]
float asin(float x) {
  float negate = float(x < 0);
  x = abs(x);
  float ret = -0.0187293;
  ret *= x;
  ret += 0.0742610;
  ret *= x;
  ret -= 0.2121144;
  ret *= x;
  ret += 1.5707288;
  ret = 3.14159265358979*0.5 - sqrt(1.0 - x)*ret;
  return ret - 2 * negate * ret;
}

Code: [Select]
float acos(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0 - x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
}

Code: [Select]
float atan(float y, float x)
{
float t0, t1, t2, t3, t4;

t3 = abs(x);
t1 = abs(y);
t0 = max(t3, t1);
t1 = min(t3, t1);
t3 = float(1) / t0;
t3 = t1 * t3;

t4 = t3 * t3;
t0 = -float(0.013480470);
t0 = t0 * t4 + float(0.057477314);
t0 = t0 * t4 - float(0.121239071);
t0 = t0 * t4 + float(0.195635925);
t0 = t0 * t4 - float(0.332994597);
t0 = t0 * t4 + float(0.999995630);
t3 = t0 * t3;

t3 = (abs(y) > abs(x)) ? float(1.570796327) - t3 : t3;
t3 = (x < 0) ? float(3.141592654) - t3 : t3;
t3 = (y < 0) ? -t3 : t3;

return t3;
}

I found them on nvidia at http://developer.download.nvidia.com/cg/asin.html that used the "Handbook of Mathematical Functions - M. Abramowitz and I.A. Stegun, Ed.". I downloaded the free version of the ebook, but couldn't understand how those values/constants were achieved, but could be worthful give a try and see if we can have those constants with a bit of more precision.
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 25, 2019, 03:12:01 AM
Aha, COOL!  8)

Do you really need real8 precision?
Afterall the end results are natural numbers ( integer values ).
I'll try to write the trig functions and put something together in my sparse spare time. ( this can take some time. )
Or you can try the LUT approach in my previous post.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 25, 2019, 04:13:34 AM
Aha, COOL!  8)

Do you really need real8 precision?
Afterall the end results are natural numbers ( integer values ).

 :t :t Unfortunately, we will need Real8 since all data (Including those on the LUT) are written in Real8 (Matrices and constants) and we will need precision to avoid that during the backwards computation we won´t end up on another value for RGB. For example: if the input is R 200, G 251, B 108, it will be transformed to Luma, Chroma and Hue. But if the user do the conversion again, if we don´t grant a bit more precision, it will most likelyt the results be R 198, G 250, B 107, and if we do the conversion again...other numbers can be generated on a cascading error.

Quote
I'll try to write the trig functions and put something together in my sparse spare time. ( this can take some time. )
Or you can try the LUT approach in my previous post.

Tks  :t :t :t :t

Yeah, i´ll try the LUT approach oncei suceed to fix the XAxis problem. If i suceed i can try to use integer for Luminosity and Chroma, forcing them to stay on limits as, let´s say: 0 to 255 (as the same for RGB integer) and for the hue ranges i´ll need to see how big is the difference on each luma before giving a try to make them work on integers too.

For example: Luma can be done as simple integers whose real values comes from the LUT. Chroma also maybe used as such (If the range is not too big) for each Luma range. So, If Luma = 100 (Index), Chroma can have a range between 58.584 to 137.1245 and thus, it would be enough to make DeltChroma/255 and use this fraction to be multiplied by the integers to retrieve the correct chroma values back from the LUT. Ex:
x*(137.1245-58.584)/255 + 58.584
On this way we can use an integer (x, for example from 0 to 255) to it stay withing the range of 137.1245-58.584

About doing the same for Hue, i´m not sure yet, since i´ll need to check if the  ranges for it are too much bigger then the ones from Chroma and how Hue is affected by Chroma as well. But, it is possible to use Hue as a range of integers as you proposed from 0 to 2048 (for example. or even from 0 to 255 too, is possible to be made) and without needing to build an extra table for that. It would be possible only calculate a ratio for Hue too after seeing precisely how it is related to chroma

At the end, if all of this goes ok, everything will be calculated taking the integers for Luminance in account and simply adjusting chroma (or hue, both in integers too, i hope) to retrieve the proper values from the Table of structures containing only 256 elements.
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 25, 2019, 04:50:06 AM
Quote
:t :t Unfortunately, we will need Real8 since all data (Including those on the LUT) are written in Real8 (Matrices and constants) and we will need precision to avoid that during the backwards computation we won´t end up on another value for RGB. For example: if the input is R 200, G 251, B 108, it will be transformed to Luma, Chroma and Hue. But if the user do the conversion again, if we don´t grant a bit more precision, it will most likelyt the results be R 198, G 250, B 107, and if we do the conversion again...other numbers can be generated on a cascading error.

We store the real4 end results for the backwards conversion, it will have enough precision.
We only have to  convert them from real4 to integer ARGB format if we present them to the screen as a Bitmap Image.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 25, 2019, 05:30:39 AM
Hmm...Ok..If you think that we can not be in risk of having an cascade effect, we can try with real4 on the outputed results of Hue.


Btw...I`m close enough to a fix  :dazzled:.

I found this constant to be multiplied with x before it is transformed to "a' "b"

1- [(1/(8.8564516e-3)^1/3)^2/29]*0.628609323645896273706847552307563478876198609541738700849

The true value of X in order to stay in the limits of sqrt(29) (To produce a valid hue) should have divide by (almost) it´s half. (Which make sense, if we consider CieLCh as a cylinder whose x axis are, in fact, it´s half (The radius). Probably the error was that "x" in original CieLCh equation is the diameter and not the radius of the cylinder ? )

0.4936202640503811484389906431356658372727441795149125

This number generated only 39208 errors from all of the 16 million colors. . Those few errors were extrapolations of sqrt(29) for hue on an error margin of something around 1e-3.  I`m too damn close to a constant value :greensml: :greensml: :greensml:. Pretty sure that this is caused by the limit of the maximum Y be  bit smaller then 1. I´ll do some more checks before trying to use this constant in the other color models to tests if it is really a valid one.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 25, 2019, 06:45:06 AM
Ok, for HDTV, C and D65 Models (2º degree observer) the amount of errors were small. Less then 40000 among 16 million colors. (All errors with a tiny margin of error of something around 1e-3)

For D50 models (prophoto, Beta, colormatch) , it have thounsand of errors, but all of those, on a small margin of 1e-3, except 1 error of something around 1.28
For CiE_RGB_E it also have huge errors and all of them on smaller margins (1e-3), except one that resulted on an error of 1.26

Amazingly, NTSC RGB "C" did not produced any error.


I´m building a small fix using that constant and an error check for the maximum Y to see if it will decrease even more the errors. I know the error margin is really small, but it is there yet.

From all those errors, i´m thinking that, in fact, CIE made an incorrect analysis of CieLab/CieLCH colorSpace. The x from XYZ when transformed to lab/Hue/Chroma represents the radius of the cylynder (this colorspace is a parabolic cylinder) and not the diameter as, apparently was being considered on the original equations. One thing is for sure now. When converting to "a" "b", x must be divided by it´s half (almost the half, in fact).

One good thing is that, now, Chroma is restricted to something around 100. I mean, it´s delta have a difference of something in between 100 to 118 on each luma range. So, it is likely that we can use integers as well :)

After fixing this with Y, i´ll made the tests to see if the backwards conversion also works on this way. (let´s hope it works too :icon_mrgreen:)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 25, 2019, 12:13:13 PM
1st tests on the reverse function (CieLCH to RGB).

The structure/texture of the image is preserved, also the general light. Needed only minor adjusts that are still extrapolating the limit of sqrt(29) on the hue.

The "greenish" is normal, because i didn´t adjusted the reversed function to be limited on the LUT Table, thus, it is shifting the hue. This was only a test to see if the backward function will still recover the RGB back with and without adjusting the luma. Remember i told the claims of CIE that Luma was totally independent was incorrect ? Luma affects hue (and also chroma). This is why some limits needs to be established on the LUT table to avoid shifting.

(https://i.imgur.com/HDOoiqo.png)

I´m finishing one or 2 functions to force the X Axis ratio to obey the limit for hue and later i´ll fix the CieLCHtoRGB to stay inside the limits for Luma, Hue and Chroma accordying to the Ws_Matrix structure (The LUT table)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 26, 2019, 12:42:26 AM
Hi Guys,

i´m starting a couple of fine tune adjustments on the equations and will start converting the outputs to integers.

It is better to we use integers as output for Luma, Chroma and Hue, right ?

From the results i´m seeing so far, i plan to make the range of integers be the same as in R, G, B. So, Luma, instead output a Real8, it will output an Integer value. The same can be done with Chroma. I´m not sure, however, what could be the best strategy for Hue.

For example, once i convert Luma to Integer, i´ll be forced to make a minor adjust to compensate potential loss of Chroma values. I chosen to make the adjustments in Chroma, since Hue (that is the actual thing that creates the color) needs to remain unchanged.

From the formula i developed, Hue is simply the sum of a constant (atan(5/2) with the asin of Luma/Chroma. So, in order to keep Hue intact, and since i´m using a Table of Lumas, all i have to do is create a multiplicand ratio to be used in Chroma. Ex: Chroma*k
where k = the compensation ratio

The value of k can be calculated as:

k = (NextLuma+16)/(PreviousLuma+16)

Where "NextLuma" is the next value of the Luma in the Ws_Matrix table and previousLuma, follows the same logic.

Ex: In the Ws_Matrix structure (Our LUT) i have:
Quote
Gray/LumaIndex            Luma Value
(...)
100                               58.69656
101                               59.01234
(...)

So, all is needed is calculate the compensation ratio, like:
(59.01234+16)/(58.69656+16)


And use the resultant value to multiply with Chroma on the RGBtoCieLab function (probably it will be needed the multiplication only on this fucntion and not on it´s reversal - CieLChtoRGB, afterall it will be presumed this correction fraction will be already done, since we are using integers on CieLCh to RGB too.

The compensation ratio in between each Luma is really small (approximately 1.05 to a max of 1.002), but it is important to avoid cascading errors and also keep the maximum amount of pixels (the 16 millions of colors) ready to be used on the other parts of the function and it´s reversal.

At 1ft, the compensation of Chroma from Luma is the easiest part and won´t affect the general performance of RGBtoCieLCH.

About Chroma using integers, it can be done (From 0 to 255, too). However, it won´t be able to do the same compensation as made in Luma, since Chroma is a fraction of Luma, and thus, it will simply revert luma to a Real8 if i also do that compensation for chroma.

Then i have 2 alternatives.

a) Creating a compensation ratio specific for Chroma, but will affect Hue angle (Won´t affect too much, btw, but i really don´t know the actual results yet. I mean, even if i increase hue on the RGBtoCieLCh, in theory this can be reverted back on the CieLCHtoRGB)
or
b) Accept eventual the loss of accuracy (In hue) and make chroma uses only 255 values without any compensation

Question #1 - What do you prefer ?

Question #2 - If i use Hue also as integer, it will be better i create a output parameter on RGBtoCieLCH where the Real8 value of Hue will be stored (To display, for example, on a editbox, staticbox etc), . Sure, Hue will be also used as an index from the LUT, but the user may wants to actually see what is the Hue he is working with. So, add an extra parameter or keep everything as integers and let the user calculate hue from the index by his own ?

Most Likely, it is possible to add some special flags on the CieLCHtoRGB function allowing the user to shift the hue all over it´s range, or limited to it´s own luma range, or also link all values together (luma, chroma, hue can be automatically corrected), or link only (Chroma/hue), or even recoloring an image using only the minimum or maximum hue for each luma range etc etc.  Not sure yet, what can be added as a flag, since i didn't started to work on this particular part of the backward function (That needs correction 1st to stay in within the limits of LUT).

Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on February 26, 2019, 05:17:10 AM
congrats on making a shehulk app  :greenclp:
integers vs floats,check raymonds page about fixed points a 3rd alternative
for example you can have integer part+fraction part for better precision than 0-255 integers,but also fast

Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 26, 2019, 06:24:54 AM
congrats on making a shehulk app  :greenclp:
(http://i68.tinypic.com/jsixqq.jpg) I would marry her, but she´s not mature yet. :greensml: :greensml: :greensml:

Quote
integers vs floats,check raymonds page about fixed points a 3rd alternative
for example you can have integer part+fraction part for better precision than 0-255 integers,but also fast

Here, the problem won´t be speed, but strategy itself to make it easier to find the values on the look up table. All the Floating Point data are still there, hidden and precalculated. It won´t affect precision (At least, in what concerns luma whose ranges between the table are really really small.
Ex: When luma is on the range of gray = 168 (The index in the look up table), it´s value is 72.325013998827 and on the next range (169), it is 72.658443961269242.

We can force luma to stay on it´s minimum value always and compensates increasing the next Chroma on the same range a little bit (That is mainly a intensity of luma). I didn´t tested yet, but it is likely that it won´t affect the results changing luma. For chroma, i´m not so sure, since the range between the previous and next chroma is big (But, probably we can use integers to it always stay in between 0-255 too)

For example, for chroma we have on that same range as luma
Index = 168, Chroma Max = 264.299252280022415, Chroma Min = 153.273578770510227.

See ? Chroma range is always around 100 on each luma range. So we can simply creating a index and use it as a percentage. For, example, the function can be like this:

call CieLCHtoRGB, 150, 75, PointertoHue, OutRed, OutGreen, OutBlue

The 1st parameter is the integer value for luma (the index). So, when we input 150, it will look on the LUT for the correspondent Real8 value and start calculating the rest of the variables.
About chroma, when we do 75, he can also do the same, but will generate the true value of chroma from a percentage. (But instead 0 to 100, using 0 to 255 to force it to achieve a bit more accuracy)

Since the difference in Luma (using integers) i can compensate increasing on Chroma (That, by consequence, will adjust hue to it always be in it´s limits), i maybe can also adjust chroma on that same way.

Each Luma range has his unique values for Chroma and Hue, that´s why i thought in use at least luma and chroma as integer so we can make it a simple index rather than a float/real8 value. For example, when chroma = 130, it will be, necessarily to stay in another luma range and no longer in 168.

Also, when Luma Range is 168 we have 'only" 102966 pixels that falls on this range distributed among a Hue range from 152.972º to 197.494º. So it is something around 2000 pixels per hue only. If we divide chroma by 255 it will then be 9 pixels only per chroma/hue. Considering that those 9 pixels are very similar to each other, using integers, probably, won´t affect the general looks of the image.

The maximum it will happens (in theory) is increase a little bit the range of Hue on each luma range,but won´t affect the conversion of the pixels, since the equations are now fixed.

Didn´t tested yet, but...it seems valid, and  is a form to use the method proposed by Marinus as well, but using precalculated ratios, than creating more tables.
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 26, 2019, 06:56:41 AM
Quote
It is better to we use integers as output for Luma, Chroma and Hue, right ?

If I speak for myself, the most convenient and fastest way is convert the 32 bit pixels to 4 floats, do your math on them and when done, you convert them back to integer 32 bit pixels to present them to your screen.

This can be done very fast in SSE2, all 4 ARGB members in one go.
And without shifting them in place.  8)

Code: [Select]
.data
align 16
ARGB_pixels dd 0ff010203h,0ff040506h,0ff070809h,0ff0a0b0ch

.code

    pxor        xmm6,xmm6                   ; Empty the source operand, to zero the integer high parts,
                                            ; inside the "punpcklbw" and "punpcklwd" instructions
    movdqa      xmm7,oword ptr ARGB_pixels  ; Load 4 ARGB pixels at once

    movq        xmm0,xmm7                   ; Load first 32 bit ARGB pixel
    punpcklbw   xmm0,xmm6                   ; Convert 4 bytes to 4 words
    punpcklwd   xmm0,xmm6                   ; Convert 4 words to 4 dwords
    cvtdq2ps    xmm0,xmm0                   ; Convert 4 dwords to 4 real4 values

; Do all floating point calculations here
; when done, convert 4 real4 values at once to a 32 bit ARGB pixel

    cvtps2dq    xmm0,xmm0                   ; Convert 4 real4 values to 4 dwords
    packssdw    xmm0,xmm0                   ; Convert 4 dwords to 4 words
    packuswb    xmm0,xmm0                   ; Convert 4 words to 4 bytes
    movd        dword ptr [edi],xmm0        ; Store as 32bit ARGB format
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 26, 2019, 07:35:39 AM
What Magnus proposed is also very handy.
A fixedpoint ( 16:16 or whatever range or precision you need ) index pointer.
I have used this very often in sound programming for changing frequencies or for timing calculations by counting fractions of the sound samples.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 26, 2019, 09:21:36 AM
Hi marinus,

I´m not sure i´m understanding this. I didn´t used fixed points yet to be used as an index.

About SSE, we can use it. However, each byte is used as an index, so, how to load them with SSE to point to the proper location ?

For example:
Pixel1 = 0, 175, 15, 20
Pixel2 = 0, 35, 11, 150
Pixel3 = 0, 98, 18, 14
Pixel4 = 0, 5, 15, 10

Even if i load all 4 pixels at once, i still would need to take each byte separated, convert it to a dword and point it to the proper location on the table, and then do the math on each value grabbed by the index. I realize this is faster, but never did like that before using SSE. Also, there´s a problem when the image/video is using pitch to point to the proper address (like in VirtualDub, for example), how to overcome this ? And what if the image width or height is not a multiple of 4 ?

If you and Magnus could provide a working example it would be better to visualize all of this an try to make it work that way. Examples of fixed point to be used like that, and the retrieval of the values on a table using SSE after being converted from byte, then it would be better for me to understand how this can be done.

I´m trying to optimize the functions using pointers to tables but i´m clueless how to make it faster using SSE or fixed point.

Also for the output, what we will see for chroma, for example, wouldn´t be better it always starts from 0 to 100 or 0 to 255, considering that Chroma is also a range linked to the range of luma ?

For example, the outputs (Limits) of all limits i've got so far are in the form of a table, each one of them on it´s own range.
Example:

Quote
Gray/Index            Luma                 Chroma Min     Chroma max          Hue Min        Hue Max
100                      65.54654             35.456               85.659              118.65º           175.656º
...
255                      100                     253.79               253.85              180º              180º

and so on

Those values are pre-calculated and if the user inputs, for example, Red = 100, it will then, look for the limits on this table above, simply pointing to offset "100" and taking the values from there in order to check the boundaries.

For RGBtoCieLCH, it will do the math to it calculate Chroma, Hue and Luma (after using red, green and blue also as pointers to other tables, like the Kfactor map that contains pre-calculated all the gamma to be multiplied to the tristimulus matrix)

For CieLCHtoRGB, it will then use the above table to check the limits. For example,  if you input Chroma = 150.11254  and Index(Luma/Gray) = 100, it will 1st check if the chroma you inputed is inside the Min and Max Chroma. If it´s inside, it will continue and do the math to convert it back to RGB. If not, it will try to fix the inputed values. The fix can be made, pointing Chroma to the maximum value, since there´s no chroma of 150.11 and the nearest value is 85.659.

That´s why i thought in using integers, because we can use the integer values 0 to 255 as a percentage. On this way, you can´t input incorrect values anyway, because if you use as input for Chroma = 128 (for example), it will calculate the chroma as half of the range from the min and max (85.659-35.456)/2 and thus it will always stays within the limits for each range on the table.




Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 26, 2019, 11:26:01 AM
It all depends on your style of coding.
There are many many ways to get the same result.

The only advantage of SIMD is, you can do multiple calculations at once.
In the case of color conversions, SIMD is ideal.
You can calculate all 3 RGB values in parallel.
And with some horizontal and vertical shuffles you can do cool things too.

Using a real4 value as an index pointer:

- Convert the 4 real4 values to 4 dwords ( cvtps2dq xmm0,xmm0 )
- Shuffle the dword you need to position 0 ( shufps xmm0,xmm0,Shuffle(0,3,2,1) )
- Copy the dword to a register ( movd eax,xmm0 )
- Use eax as the index pointer for the LUT.
- Shuffle the next dword in position etc.

This way you don't need fixedpoint integers to get a value from a LUT.
And your LUT can be any size if you use a factor which gives you the correct position in the LUT.

SSE2 also has min max instructions, useful to clamp ranges, as in your algorithm for chroma and hue.
You can clamp  all 2 of them at once with 2 instructions.

Minimum_mask    real4 0.0,0.0,35.456,118.65
Maximum_mask   real4 0.0,0.0,85.659,175.656

 maxps   xmm0,Minimum_mask
 minps   xmm0,Maximum_mask

Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 26, 2019, 12:34:43 PM
Amazing !

I really will have to start trying to port the code to SSE. I rarely rarely use those instructions and it is a bit hard for me to follow yet.

About shufps, what is the macro "Shuffle(0,3,2,1)" ?
Can i write it as:
Quote
[SHUFFLE | (#4 or (#3 shl 2) or (#2 shl 4) or (#1 shl 6) ) ] ;  used as a paramacro.

shufps xmm0 xmm0 {SHUFFLE 0,3,2,1}

Is this correct ? And if i want to use the position as  {SHUFFLE 3,2,0,1}  ? The value will always  be positioned at "0" and the other values will be exchanging their pos, right ?


I´ll start adjusting the code and fix the backwards functions. Once i finish i´ll give a try in porting all of it to SSE2. I´m just worry in trying to make it portable to others cases when we have a RGBQUAD usage rather then a ARGB etc... using this shufps may also solve this in order to keep portability.


SSE macros may also be very handy to use. Do you have any that can make comparisons ? As the one we created for FPU in Rosasm ?

Code: [Select]
Fpu_If R$MyVar = R$Zero
; do this:
Fpu_End_If

Which unrolled produces:

    FLD R$Zero
    FLD R$MyVar
    FCOMPP
    FSTSW AX
    FWAIT
    SAHF
    JNE L0>>

; Do this

L0:

Or...
Code: [Select]
Fpu_If R$MyVar > R$Zero
; do this:
Fpu_End_If

Which unrolled produces:

    FLD R$Zero
    FLD R$MyVar
    FCOMPP
    FSTSW AX
    FWAIT
    SAHF
    jna L0>>

; Do this

L0:

How to make similar ones (for comparisons) using SSE 2 ?
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 26, 2019, 03:42:36 PM
Do you have any that can make comparisons ?

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  SetFloat xmm0=1234567890.12345
  SetFloat xmm1=1234567890.12346
  PrintLine "default precision:"
  Fcmp (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1201) xmm0, xmm1
 .if Zero?
        PrintLine Str$(f:xmm0), "==", Str$(f:xmm1)
 .elseif FcmpLess
        PrintLine Str$(f:xmm0), Chr$(60), Str$(f:xmm1)
 .else
        PrintLine Str$(f:xmm0), Chr$(62), Str$(f:xmm1)
 .endif 
  PrintLine CrLf$, "top precision:"
  Fcmp xmm0, xmm1, top
 .if Zero?
        PrintLine Str$(f:xmm0), "==", Str$(f:xmm1)
 .elseif FcmpLess
        PrintLine Str$(f:xmm0), Chr$(60), Str$(f:xmm1)
 .else
        PrintLine Str$(f:xmm0), Chr$(62), Str$(f:xmm1)
 .endif 
  Inkey CrLf$, "comparing floats is fun!!!!"
EndOfCode


Code: [Select]
default precision:
1234567890.12345==1234567890.12346

top precision:
1234567890.12345<1234567890.12346

comparing floats is fun!

Attached a project comparing directly xmm0 and ST(0).

Re shuffling, see SwapBytes (http://masm32.com/board/index.php?topic=6483.msg82542#msg82542)
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 26, 2019, 10:06:09 PM
Amazing !

I really will have to start trying to port the code to SSE. I rarely rarely use those instructions and it is a bit hard for me to follow yet.

About shufps, what is the macro "Shuffle(0,3,2,1)" ?
Can i write it as:
Quote
[SHUFFLE | (#4 or (#3 shl 2) or (#2 shl 4) or (#1 shl 6) ) ] ;  used as a paramacro.

shufps xmm0 xmm0 {SHUFFLE 0,3,2,1}

Is this correct ? And if i want to use the position as  {SHUFFLE 3,2,0,1}  ? The value will always  be positioned at "0" and the other values will be exchanging their pos, right ?

No, I used the shufps instruction for rotating the 4 positions within xmm0 to extract each dword from pos 0 - 3

Code: [Select]
    movd    eax,xmm0                    ; get dword from position 0 = 0     [3 2 1 0]
    shufps  xmm0,xmm0,Shuffle(0,3,2,1)  ; positions   [0 3 2 1]
    movd    eax,xmm0                    ; get dword from position 0 = 1       
    shufps  xmm0,xmm0,Shuffle(0,3,2,1)  ; positions   [1 0 3 2]
    movd    eax,xmm0                    ; get dword from position 0 = 2
    shufps  xmm0,xmm0,Shuffle(0,3,2,1)  ; positions   [2 1 0 3]
    movd    eax,xmm0                    ; get dword from position 0 = 3

If you want for example, position 2 directly:

Code: [Select]
    shufps  xmm0,xmm0,Shuffle(3,0,1,2)  ; positions 2 and 0 are swapped
    movd    eax,xmm0                    ; get dword from position 0

Keep in mind that the positions stay in the shuffled state.
So you need to keep track of the changed positions within the xmm register after each shuffle.

If you don't like this, you can use pshufd and shuffle the value you need directly to another xmm register.

Code: [Select]
    movd    eax,xmm0
    pshufd  xmm1,xmm0,Shuffle(0,0,0,1)
    movd    eax,xmm1
    pshufd  xmm1,xmm0,Shuffle(0,0,0,2)
    movd    eax,xmm1
    pshufd  xmm1,xmm0,Shuffle(0,0,0,3)
    movd    eax,xmm1

The shuffle macro I use:

Shuffle MACRO V0,V1,V2,V3
    EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

common used compare instructions:

comiss comisd
ucomiss ucomisd
cmpps cmpss cmpps cmppd
pcmpeqb pcmpeqw pcmpeqd
pcmpgtb pcmpgtw pcmpgtd , and many more....

Code: [Select]
Equal CMPEQSS
Equal CMPEQPS
Less Than CMPLTSS
Less Than CMPLTPS
Less Than or Equal CMPLESS
Less Than or Equal CMPLEPS
Greater Than CMPLTSS
Greater Than CMPLTPS
Greater Than or Equal CMPLESS
Greater Than or Equal CMPLEPS
Not Equal CMPNEQSS
Not Equal CMPNEQPS
Not Less Than CMPNLTSS
Not Less Than CMPNLTPS
Not Less Than or Equal CMPNLESS
Not Less Than or Equal CMPNLEPS
Not Greater Than CMPNLTSS
Not Greater Than CMPNLTPS
Not Greater Than or Equal CMPNLESS
Not Greater Than or Equal CMPNLEPS
Ordered CMPORDSS
Ordered CMPORDPS
Unordered CMPUNORDSS
Unordered CMPUNORDPS
Equal COMISS
Less Than COMISS
Less Than or Equal COMISS
Greater Than COMISS
Greater Than or Equal COMISS
Not Equal COMISS
Equal UCOMISS
Less Than UCOMISS
Less Than or Equal UCOMISS
Greater Than UCOMISS
Greater Than or Equal UCOMISS
Not Equal UCOMISS


some need some extra explanation:

Code: [Select]
Pseudo-Op            CMPPS Implementation

CMPEQPS xmm1, xmm2    CMPPS xmm1, xmm2, 0
CMPLTPS xmm1, xmm2    CMPPS xmm1, xmm2, 1
CMPLEPS xmm1, xmm2    CMPPS xmm1, xmm2, 2
CMPUNORDPS xmm1, xmm2 CMPPS xmm1, xmm2, 3
CMPNEQPS xmm1, xmm2   CMPPS xmm1, xmm2, 4
CMPNLTPS xmm1, xmm2   CMPPS xmm1, xmm2, 5
CMPNLEPS xmm1, xmm2   CMPPS xmm1, xmm2, 6
CMPORDPS xmm1, xmm2   CMPPS xmm1, xmm2, 7

Just look them up in the Intel Software Developer's Manuals:

https://software.intel.com/en-us/articles/intel-sdm

The next document contains the full instruction set reference, A-Z, in one volume.
Describes the format of the instruction and provides reference pages for instructions.
This document allows for easy navigation of the instruction set reference through functional cross-volume table of contents, references, and index:

https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 01:37:12 AM
Tks, Marinus and JJ :t :t :t

I´ll take a look on those and try to learn the SSE coding. It is really interesting.

JJ. Your fcmp macro is actually a function, right ? movups loads 4 Real4 at once, is that it ?


Marinus, i´ll start to read the manual this afternoon. In the meanwhile i found this from the other manual (https://fizyka.umk.pl/~daras/mtm/26_IA32-1-sse.pdf)

Quote
11.6.12. Branching on Arithmetic Operations

There are no condition codes in SSE or SSE2 states. A packed-data comparison instruction generates a mask which can then be transferred to an integer register. The following code sequence provides an example of how to perform a conditional branch, based on the result of an SSE2 arithmetic operation.

cmppd XMM0, XMM1; generates a mask in XMM0
movmskpd EAX, XMM0; moves a 2 bit mask to eax
test EAX, 0,2 ; compare with desired result
jne BRANCH TARGET

The COMISD and UCOMISD instructions update the EFLAGS as the result of a scalar comparison. A conditional branch can then be scheduled immediately following COMISD/UCOMISD.

I gave a try using a Real8 to load the data and came up with  this:
Code: [Select]
[SSEData2: R$ 17.656]
[SSEData3: R$ 25.656]

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1 ; "Test_if is a RosAsm macro. Unrolled is the same as "test eax 1 | jne L2> ; do this L2:"
        ; do this
    Test_End

The comparision was ok.

Then i tried a variation of this converting the Real4 to Real8 like this:

Code: [Select]
[SSEData2b: F$ 17.656] ; Real4
[SSEData3b: F$ 25.656] ; Real4

     cvtss2sd xmm0 X$SSEData2b
    cvtss2sd xmm1 X$SSEData3b
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         mov eax eax
    Test_End

Also was ok. I tried to convert from Real8 to Real4 using cvtpd2ps and it was ok, too.

But, when i tried to do both in sequence, it occured an error

Code: [Select]
    ; A variante using movq to load the data.
    movq xmm0 X$SSEData2
    movq xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1 ; Comparision was Ok, it go to the next line
;         do this
    Test_End

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1 ; here the error happened. It jumped over the test !
         ; do this.
    Test_End

The error was because the 1st cmppd produced a QNAN on the Loquadword of xmm0, resulting in "0 QNAN" when visualizing the Packed double on RosAsm debugger.

And when i tried to compare again, the 2nd value to be compared in the xmm0 contained the QNAN, causing eax to return 0, rather then 1, right ?

So, the soluton is always clear xmm0 and xmm1 before loading them to those registers as:

Code: [Select]
    xorps xmm0 xmm0 ; clear registers to avoid errors
    xorps xmm1 xmm1

    movq xmm0 X$SSEData2
    movq xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         Do this
    Test_End

    xorps xmm0 xmm0 ; clear registers to avoid errors
    xorps xmm1 xmm1

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         Do this
    Test_End

But, is there a equivalent to fcompp in SSE2 to avoid having to use 2 extra opcodes to clear xmm0 and xmm1 before being used ?

I´m asking this, in order to try to create some easier to read macro using this SSE2 opcodes for branching prediction. JJ showed how to use comisd also for comparisons similar to these ones.

But...can SSE2 compare 2 or 4 data in xmm0 (lo and Hi quadword) at the same time ?

I tried this, but didn´t understood the results in eax

Code: [Select]
[SSEData2a: R$ 8, 7]
[SSEData3a: R$ 2, 9]

    xorps xmm0 xmm0 ; clear registers to avoid errors
    xorps xmm1 xmm1

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         mov eax eax
    Test_End

It is comparing 8 with 2 and 7 with 9 at the same time, right ? So, what should be the expected values in eax ?

I mean, so, i could make the routines do 2 things at the same time according to the result in eax.
Ex:
if eax = 00_01 (binary) ; It means that 8 < 2
; do this
If eax = 00_10 ; it means that 7 > 2
; do that
If eax =00_11 ; both are smaller
; so this and that at

Is there a way to make those conditional branch at once  ?
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 27, 2019, 02:56:21 AM
For 2 real8 you have 2 sign states in eax witch gives you 4 possibilities ( movmskpd )
For 4 real4 you have 4 sign states in eax witch gives you 16 possibilities ( movmskps )

This gives you also the possibility to use a jump table to execute code without branching.
For RGB conversion you will be able to handle 3 sign bits ( 8 possibilities ).

So, 1 compare makes it possible to execute the code at once without branching.

For example this piece of pseudo code (you can see I'm learning xyz2lab  :biggrin: ) where you can compare red green and blue at once and jump to 1 of the 6 possibilities.

   XR = 95.047
   YG = 100.000
   ZB = 108.883

   x = x / XR
   y = y / YG
   z = z / ZB

   if (x > 0.008856) x = pow(x, 1.0 / 3.0)
   else x = (x * 7.787) + (16.0 / 116.0)

   if (y > 0.008856) y = pow(y, 1.0 / 3.0)
   else y = (y * 7.787) + (16.0 / 116.0)

   if (z > 0.008856) z = pow(z, 1.0 / 3.0)
   else z = (z * 7.787) + (16.0 / 116.0)

Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 27, 2019, 03:49:33 AM
JJ. Your fcmp macro is actually a function, right ? movups loads 4 Real4 at once, is that it ?

A macro combined with a function that expects two numbers on the FPU.

  SetFloat xmm0=1234567890.1234567
  SetFloat MyR4=1234567890.1234567      ; MyR4 is a REAL4
  deb 4, "Comparing two floats:", MyR4, f:xmm0
  Fcmp MyR4, f:xmm0, high      ; compare the two with high precision (top would be strict, medium more tolerant)

This creates the following under the hood:
Code: [Select]
0040143F      ³.  DDC7              ffree st(7)                          ; float 1234567890.1234567000
00401441      ³.  83EC 08           sub esp, 8                           ; create a REAL8 slot
00401444      ³.  0F130424          movlps [esp], xmm0                   ; load xmm0 into slot
00401448      ³.  DD0424            fld qword ptr [esp]                  ; push value on FPU
0040144B      ³.  83C4 08           add esp, 8                           ; correct stack
0040144E      ³.  DDC7              ffree st(7)
00401450      ³.  D943 8A           fld dword ptr [ebx-76]               ; MyR4, the REAL4
00401453      ³.  6A 2D             push 2D                              ; requested precision
00401455      ³.  E8 46100000       call MbFloatCmp                      ; ÀFcmp.MbFloatCmp
0040145A      ³. 75 2B             jnz short C0004
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 03:55:29 AM
Quote
For 2 real8 you have 2 sign states in eax witch gives you 4 possibilities ( movmskpd )
For 4 real4 you have 4 sign states in eax witch gives you 16 possibilities ( movmskps )

This gives you also the possibility to use a jump table to execute code without branching.
For RGB conversion you will be able to handle 3 sign bits ( 8 possibilities ).

So, 1 compare makes it possible to execute the code at once without branching.

Tks a lot, Marinus. I´ll give a try this afternoon after reading the intel manual about the basics of SSE.


Quote
For example this piece of pseudo code (you can see I'm learning xyz2lab  :biggrin: ) where you can compare red green and blue at once and jump to 1 of the 6 possibilities.
:greensml: :greensml: :greensml: :greensml: :greensml:
It´s cool, right ? :icon_mrgreen: :icon_mrgreen:

Oh...just keep in mind that the equation in CIE to convert XYZ to Lab is incorrect. Bruce made the initial fix and i fixed the rest.

All you have to do is multiply immediately after it checks the threshold at:

Quote
  if (x > 0.008856) x = pow(x, 1.0 / 3.0)
   else x = (x * 7.787) + (16.0 / 116.0)

The resultant x value must be multiplied with this:
(http://i68.tinypic.com/2j1ahig.png)

or...on a more simplified  way. Multiply x with:
(http://i63.tinypic.com/2ls9kp3.png)

probably it will result on a greenish image if you don´t adjust "a" and "b' to also stays withinh the range of luma, but this is  (in fact) what the CieLab equations are actually doing.

It´s only the final "x" that is incorrect. This fix, will also works once you convert it to CieLCH as well.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 04:27:15 AM
JJ. Your fcmp macro is actually a function, right ? movups loads 4 Real4 at once, is that it ?

A macro combined with a function that expects two numbers on the FPU.

  SetFloat xmm0=1234567890.1234567
  SetFloat MyR4=1234567890.1234567      ; MyR4 is a REAL4
  deb 4, "Comparing two floats:", MyR4, f:xmm0
  Fcmp MyR4, f:xmm0, high      ; compare the two with high precision (top would be strict, medium more tolerant)

This creates the following under the hood:
Code: [Select]
0040143F      ³.  DDC7              ffree st(7)                          ; float 1234567890.1234567000
00401441      ³.  83EC 08           sub esp, 8                           ; create a REAL8 slot
00401444      ³.  0F130424          movlps [esp], xmm0                   ; load xmm0 into slot
00401448      ³.  DD0424            fld qword ptr [esp]                  ; push value on FPU
0040144B      ³.  83C4 08           add esp, 8                           ; correct stack
0040144E      ³.  DDC7              ffree st(7)
00401450      ³.  D943 8A           fld dword ptr [ebx-76]               ; MyR4, the REAL4
00401453      ³.  6A 2D             push 2D                              ; requested precision
00401455      ³.  E8 46100000       call MbFloatCmp                      ; ÀFcmp.MbFloatCmp
0040145A      ³. 75 2B             jnz short C0004

Great work. :t :t :t :t

At the end what it is doing is comparing the values in FPU, rather then SSE, right ?

I was thinking in something similar to the one as in the "If" (Same as in the masm version) or "Fpu_If" (In rosasm.I don´t know if there is a similar one for masm) Macro using the SSE instructions more directly, similar to what you did with the comisd instruction.
Title: Re: Fast Compare Real8 with SSE
Post by: jj2007 on February 27, 2019, 05:36:26 AM
Great work. :t :t :t :t

Thanks :bgrin:

Quote
At the end what it is doing is comparing the values in FPU, rather then SSE, right ?

Yes, more or less. There was a long thread discussing various options some years ago (http://www.masmforum.com/board/index.php?topic=12998.0).

If you want the real fun, try this:

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  SetGlobals temp:REAL10
  Init
  SetFloat temp= 123.45678901234567891   ; temp is a REAL10
  SetFloat ST(0)=123.45678901234567890
  deb 4, "Comparing two floats:", ST(0), temp
  Fcmp temp, ST(0), xtra
 .if Zero?
        Print Str$("%Jf==", ST(0)), Str$(temp)
 .elseif FcmpLess
        PrintLine Str$("temp %Jf", temp), Chr$(60), Str$(ST(0))
 .else
        PrintLine Str$("temp %Jf", temp), Chr$(62), Str$(ST(0))
 .endif
  Inkey CrLf$, "comparing floats is fun!!!!"
EndOfCode


Output:
Code: [Select]
Comparing two floats:
ST(0)           123.4567890123456789
temp            123.4567890123456789
temp 123.4567890123456789>123.4567890123456789
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 07:39:07 AM
Good news :)

1st tests when converting Luma to interger was ok. And as a result of the convertion, when i compensate Luma to be adjusted to Chroma, no more errors whatsoever on extrapolations of the sqrt(29) limit.  It seems that the compensation ratio adjusts Chroma and Hue on such a way that it will always be within their boundaries.

I´ll make a minor test on the backward function (it do needs to calculate the reversed compensation ratio, but...no problem. It won´t affect performance. It will be needed to check the limits anyway, so, a pointer to another member of the structure won´t make any significant  difference in terms of speed, and any eventual loss of speed can be compensated later when i start to give a try and convert it to SSE and Real4 to be used in the matrix. I just hope we won´t loose accuracy after converting all variables to Real4 rather then Real8)

The errors on the algorithm when show up are a kind of hide and seek play :greensml: :greensml: :greensml:


One question. How to calculate the probability or density of a certain range of data ?

THis is to analyse what CieLCH is doing with the pixels. I mean, after the fixes, i suceeded to map all 16 million pixels on another table from 0 to 255 (also biased on the luma), and figure it out that the distribution is almost proportional (or making a sine curve, perhaps ?) all over the table.

For example:
Gray/Luma = 0, total pixels = 38
Gray/Luma = 1, total pixels = 250
Gray/Luma = 2, total pixels = 330
....
Gray/Luma = 142, total pixels = 128694
after that it starts decreasing again.
Gray/Luma = 143, total pixels = 108965
Gray/Luma = 144, total pixels = 107555
...
Gray/Luma = 255, total pixels = 0

It seems that i suceeded to adjust CieLab equations to behave as a parabolic cylinder, but the midpoint seems to be at 142 and the decresing values does not match to their opposite side.

So, how to "equalize" the distribution of a range of data in order to it results in something like this:

Gray/Luma = 0, total pixels = 38
Gray/Luma = 1, total pixels = 250
Gray/Luma = 2, total pixels = 330
....
Gray/Luma = 126, total pixels = 100000
Gray/Luma = 127, total pixels = 128694 ; midpoint
Gray/Luma = 128, total pixels = 128694 ; midpoint
after that it starts decreasing again.
Gray/Luma = 129, total pixels = 100000
...

Gray/Luma = 253, total pixels = 330
Gray/Luma = 254, total pixels = 250
Gray/Luma = 255, total pixels = 38

Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 11:28:18 AM
1st result using Integer as input and output as Luma settled at 49% :t

The fix on the limits for Chroma and Hue shift are not done yet. I´m currently working on it

SheHulk is looking better :greensml: :greensml: :greensml: :greensml: :greensml: :greensml: :greensml:
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 27, 2019, 11:47:52 AM
 :t

I've made her an avatar from Pandora  :biggrin:

(http://members.home.nl/siekmanski/she_hulk.png)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 11:59:45 AM
 :greenclp: :greenclp: :greenclp: :greenclp: :greenclp:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 27, 2019, 07:50:01 PM
SheHulk is getting mature, now :greensml: :greensml: :greensml: :greensml:


(http://i63.tinypic.com/2uqdfcz.png)

Hi guys, i fixed Luma and Chroma. Now Luma is a  integer from 0 to 255 and Chroma (Which, really is a intensity of Luma, as i mentioned before) i settled it as a Real8 number from 0 to 100. The hue shift i´ll solve the same way i did for Chroma. It is needed to it stay also within the boundaries of Luma inside the LUT. There is a minor, tiny loss of chroma (around 2%) due to the shifting of Hue yet.

On this test, i reduced Luma to 49% and kept chromacity percentage untouched. Once i fix the hue shifting problem, i´ll see the best strategy to link both (Hue and chroma) and start creating some flags to be used during the output  (CieLCH to RGB)
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on February 27, 2019, 08:09:12 PM
 :t
Title: Re: Fast Compare Real8 with SSE
Post by: FORTRANS on February 28, 2019, 12:55:31 AM
Hi,

So, how to "equalize" the distribution of a range of data in order to it results in something like this:

Gray/Luma = 0, total pixels = 38
Gray/Luma = 1, total pixels = 250
Gray/Luma = 2, total pixels = 330
....
Gray/Luma = 126, total pixels = 100000
Gray/Luma = 127, total pixels = 128694 ; midpoint
Gray/Luma = 128, total pixels = 128694 ; midpoint
after that it starts decreasing again.
Gray/Luma = 129, total pixels = 100000
...

Gray/Luma = 253, total pixels = 330
Gray/Luma = 254, total pixels = 250
Gray/Luma = 255, total pixels = 38

   Well I did this on gray scale images.  I made a quick look, but
I did not find my code for a Gaussian distribution.  What I did
was read in the pixel data, assigned a number corresponding to
its position in the picture, and then sorted the data.  I then
replaced the data with values that followed the desired distribution
and unsorted the data back to reform the image.

   A brute force method, but easy to program.  A better way is
to count up the data values and apply an algorithm to replace
an input data value with a value that will create the desired
distribution.  A somewhat harder programming job, as a given
input value may need to map to multiple output values.  So
you may actually have to think a bit when writing your program.

   I did find an example with a uniform output distribution that
can show that result if wanted.

Cheers,

Steve N.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 28, 2019, 03:16:53 AM
Tks, Steve

Quote
   Well I did this on gray scale images.  I made a quick look, but
I did not find my code for a Gaussian distribution.  What I did
was read in the pixel data, assigned a number corresponding to
its position in the picture, and then sorted the data.  I then
replaced the data with values that followed the desired distribution
and unsorted the data back to reform the image.

If you find it, can you post for me to read ? I did something like that to equalize the histogram on gray as well, but don´t know if it is ok. I´m trying to understand better how to equalize the whole 16 million of colors rather then an image itself. The distribution of all 16 million colors on the CieLCH using AdobeRGB, after the fixes i made are now more or less equally distributed, but i don´t know exactly how it reached that way. It looks line a sine distribution, but i´m clueless how to create a equation to retrieve the actual math involved with this kind of distribution. I mean, i want to find the proper equation that generated that result (The distribution), fix the math on it (if necessary) in order to it distribute the values/pixels following that equation.

As you see on the attached image, When i reduced luma to 50%, the algorithm generated a similar distribution as the original one (except, reduced to half of it´s pixels.). It is still shifting the hue, since i didn´t made the proper fix yet (hard to follow all those equations, btw and try to develop a strategy to avoid shifting :dazzled: :dazzled:)

Quote
   I did find an example with a uniform output distribution that
can show that result if wanted.

Please, can you upload it or have a link to see this ?


Tks a lot
Title: Re: Fast Compare Real8 with SSE
Post by: FORTRANS on February 28, 2019, 08:12:38 AM
Hi,

   Okay, the last version of the program did uniform weighting and
was written in 2001.  (And still had errors.)  The Gaussian distribution
version was done in 1995.  Its output looks a little rough, see
attached histograms with a maximum bin count.

   Old FORTRAN is sometimes hard to figure out when you don't
remember writing it any more.  I can post some of the code if you
want, but it follows my description above as far as I can see.
And it was for rescaling an image to improve the dynamic range.

Regards,

Steve
Title: Re: Fast Compare Real8 with SSE
Post by: guga on February 28, 2019, 01:22:28 PM
Tks, Steve. Pls post. :icon_cool: :icon_cool: I would like to read to understand it better. :t :t :t

Regards

guga
Title: Re: Fast Compare Real8 with SSE
Post by: FORTRANS on March 01, 2019, 02:23:02 AM
Tks, Steve. Pls post. :icon_cool: :icon_cool: I would like to read to understand it better. :t :t :t

Regards

guga

Hi,

   Okay.  This should be it.  I deleted some debugging code
and a comment was changed to make more sense, sort of.
But I don't think anything meaningful was changed.  Not that
this code is meaningful anyway.

Enjoy,

Steve
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 01, 2019, 06:00:31 AM
Tks a lot, steve. I´ll take a look :)  :t :t :t
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 02, 2019, 02:48:25 AM
Found a typo on the property of equations :icon_redface: :icon_redface:

I accidentally switched sin with cos when computing the maximum value of CLFactor (page 9 on the pdf)

The correct formula for a maximum CLFactor is, in fact, given by:

Code: [Select]
5*sin(Hue) -2*cos(Hue) <= CLFactor
So, Hue Maximum is given by:

Code: [Select]
HueMax = arctan(2/5) + arcsin (-CLFactor*sqrt(29)/29))

On the review i´m currently making, it won´t change the limits for X that cannot be bigger then 1-2/sqrt(29) but i was forced to try to see the true limits for Hue all over again in order to try to find a common formula to fix the X Axis.


The Hue angle in any means must not exceed 90º and now i came up with 2 real limits for X

YFinal > XFinal

or

YFinal < XFinal


Which one to choose, seems to be related to the White Reference for X.

When White Reference for X is smaller then White Reference for Y, then we must force X to be smaller then Y

and

When White Reference for X is bigger then White Reference for Y, then we must force X to be bigger then Y

The problem is when both are equal. (I´ll see how to handle this too)

Also, X have other limitation that can never be bigger then  1-2/sqrt(29)

I´m thinking in how to overcome all of this. I could not found a equation for X yet, but at least, found it´s true boundaries. I´m only thinking in how to put all of this together on a way that it can also works on the hundreds of matrices that can be generated from the different colorspaces.

Another thing is that the Threshold computed from X, Y and Z like:

Code: [Select]
If  X > (216/24389)
    XFinal = X(1/3)
Else
     XFinal = X*(841/108) + (16/116)
Endif

Seems to be related to the slope calculated in  the gamma fixes. I did not tried to make this Threshold obey the slope yet, because i found a bug on the property of the equations as mentioned. So, i´ll try to fix this and later see if i can really force the threshold to obey the gamma/slope limits.


Thinking :dazzled: :dazzled: :dazzled:

Damn equations :greensml: :greensml:...It´s killing :dazzled: :dazzled: :dazzled: Why, why, why, CIE didn´t made the proper fixes for all of this things yet ? They do have better conditions to make the proper fix than i do. :greensml: :greensml:
Title: Re: Fast Compare Real8 with SSE
Post by: Siekmanski on March 02, 2019, 05:32:25 AM
Just ask them to fix it.  :biggrin: :biggrin:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 02, 2019, 05:42:13 AM
 :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: :greensml: :greensml:
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 05, 2019, 11:29:27 PM
Ok, guys. Trying another correction for the equations. I´ll make some tests today and show the results. Let´s hope it is finally fixed. The bad thing on all of this is, when you think you fixed one part, the other one has broken :dazzled: :dazzled: Hard to find a common way to fix all those equations. Hue angle simply refuses to stay in 90º range on all color models
Title: Re: Fast Compare Real8 with SSE
Post by: daydreamer on March 06, 2019, 06:36:04 AM
great work Guga :t
I am doing some work that maybe can be useful
http://masm32.com/board/index.php?topic=7708.msg84620#msg84620 (http://masm32.com/board/index.php?topic=7708.msg84620#msg84620)
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 06, 2019, 08:04:13 AM
Tks DayDreamer.

I should post an example tonight. I succeeded to make the backward function works but, since i changed the properties of the equations, i´m having to change also the way it represents Chroma and Hue as percentages. Need only to adapt the percentage to the new equation and find a way to link Hue and Chroma on the backward computation, in case you want, for example to decrease Luma with a given percentage that automatically can correct the ratio between chroma/hue with and without shifting (To avoid excessive greenish or reddish the image).

Btw. Can you try to make a fast asin, atan2 and acos, sin and cos and tan and sqrt using Chebyshev method ? It will be helpfull to optimize. In Real8 and Real4, please. So i can test both. I made the functions completely in Real8, so using the asin/acos;atan etc etc in real8 will be great to test this rigth away. And later, if everything works fine., i´ll try porting everything to Real4 (Including the matrices etc) and test a version with Real4 too in order to see the true differences among them before giving a try in the SSE optimization routines suggested by marinus and JJ.

Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 06, 2019, 07:47:18 PM
1st Test of the corrected CieLCh equations. Luma is a integer value from 0 to 255 (Here on this example, i settled it as 0 to 100%), Chroma and Hue works on 0 - 100% of range per luma.

Needing 'only' to find out How to properly link Chroma and Hue, but the equations and their properties are fixed (i hope).  Curiosuly, the algorithm is colorizing a grayscale image at some extent. (Not recovering the original  colors, but creating new colors from the ones it already have)

Source ccode is embeded in the app. So, to read it, you need to open the file in RosAsm. Btw: Sorry for the mess up with the code. About the statuss bars, it have some minor issues when scrolling the bars, but, i´ll fix that later.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 07, 2019, 12:36:48 AM
Fixed minor problem with the scrollbar and  attached the dll that i forgot to include yesterday.

(except freeimage.dll which i uploaded on the link below due to it´size).
https://drive.google.com/open?id=1mG85imZqFt_19YoDjPEaizgL0ZBTkpvR
Title: Re: Fast Compare Real8 with SSE
Post by: HSE on March 07, 2019, 01:58:50 AM
Hi Guga!

I have problems with your dll, but work with that downloaded from https://sourceforge.net/projects/freeimage/

Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 07, 2019, 02:37:03 AM
Hi Guga!

I have problems with your dll, but work with that downloaded from https://sourceforge.net/projects/freeimage/

Great, many thanks. Mine was the previous version. Downloaded the updated one right now :t :t :t

Later i´ll try another approach on the properties of the equation. The main problem of CieLab/CieLCH is the lack of information of the true correspondence between Hue and Chroma. I created a equation to settle it´s boundaries according to each Luma range, but on the current test, i settled it only with the max/min of the general luma, rather then the max/min on each luma range.

In order to avoid clipping in X, Y, Z i had to compute a ratio between min and max, but used the general maximum and minimum. Next test, i´ll see how it works if i settle the max/min, per luma range and create individual ratios per luma, rather then 2 general ones.

This lack of a clear correspondence between Hue/Chroma or "a"/"b" is killing. "a" and 'b" elements in CieLab were simply taken from their hats, rather then calculating as a spherical coordinates.  I´m testing with different color models and so far, it is working as expected (Without clipping, i mean). A reasonable good results seems to be when using "ECI RGB" color model.
Title: Re: Fast Compare Real8 with SSE
Post by: guga on March 10, 2019, 04:34:13 AM
Testing a new colorspace called HSM (Hue, Saturation, Mixture) created by Oswaldo Severino Jr and Adilson Gonzaga back in 2009, designed to segment colors (such as human skin). The specifications of the  original algorithm was buggy and did not retrieved the backward colors. So, i fixed it to correctly works in RGb to HSM and HSM to RGB. Although it still have some clippings problems (Why people keep forgetting to handle this, is a complete mystery to me :greensml: :greensml: :greensml: ).

Anyway, it is working, and is fast.

Some problems it still have:

1) Does not allow different color models. (I´ll implement it later in order to create a variation of the algorithm that works on all color models)
2) It has clipping problems. (I´ll fix this once i understand how to properly analyse the relation between hue and mixture)
3) Does not handle Luma. "Mixture" is, in fact a vaariation of luma, which the original author tried to represent as layers, but ´ll later see how to correctly fix this, inserting luma without loosing thew characteristics of this "mixture' thing.

Note: Despite those issues on the original algorithm, HSM seems more robust to separate and segment colors and if i suceed to combine it with mine variation of CieLCh, i´ll probably make it more robust than both and a new colorspace will be born :)

Reference info: https://www.researchgate.net/publication/265572436_HSM_A_New_Color_Space_used_in_the_Processing_of_Color_Images

Attached test file. Missing only freeimage.dll due to the size of the attachment.