Fast Compare Real8 with SSE and ColorSpaces

jj2007 · February 27, 2019, 03:49:33 AM

Quote from: guga on February 27, 2019, 01:37:12 AMJJ. Your fcmp macro is actually a function, right ? movups loads 4 Real4 at once, is that it ?

A macro combined with a function that expects two numbers on the FPU.

SetFloat xmm0=1234567890.1234567
SetFloat MyR4=1234567890.1234567 ; MyR4 is a REAL4
deb 4, "Comparing two floats:", MyR4, f:xmm0
Fcmp MyR4, f:xmm0, high ; compare the two with high precision (top would be strict, medium more tolerant)

This creates the following under the hood:

Code Select

0040143F      ³.  DDC7              ffree st(7)                          ; float 1234567890.1234567000
00401441      ³.  83EC 08           sub esp, 8                           ; create a REAL8 slot
00401444      ³.  0F130424          movlps [esp], xmm0                   ; load xmm0 into slot
00401448      ³.  DD0424            fld qword ptr [esp]                  ; push value on FPU
0040144B      ³.  83C4 08           add esp, 8                           ; correct stack
0040144E      ³.  DDC7              ffree st(7)
00401450      ³.  D943 8A           fld dword ptr [ebx-76]               ; MyR4, the REAL4
00401453      ³.  6A 2D             push 2D                              ; requested precision
00401455      ³.  E8 46100000       call MbFloatCmp                      ; ÀFcmp.MbFloatCmp
0040145A      ³. 75 2B             jnz short C0004

guga · February 27, 2019, 03:55:29 AM

Quote
For 2 real8 you have 2 sign states in eax witch gives you 4 possibilities ( movmskpd )
For 4 real4 you have 4 sign states in eax witch gives you 16 possibilities ( movmskps )

This gives you also the possibility to use a jump table to execute code without branching.
For RGB conversion you will be able to handle 3 sign bits ( 8 possibilities ).

So, 1 compare makes it possible to execute the code at once without branching.

Tks a lot, Marinus. I´ll give a try this afternoon after reading the intel manual about the basics of SSE.

QuoteFor example this piece of pseudo code (you can see I'm learning xyz2lab ) where you can compare red green and blue at once and jump to 1 of the 6 possibilities.

It´s cool, right ? :icon_mrgreen: :icon_mrgreen:

Oh...just keep in mind that the equation in CIE to convert XYZ to Lab is incorrect. Bruce made the initial fix and i fixed the rest.

All you have to do is multiply immediately after it checks the threshold at:

Quoteif (x > 0.008856) x = pow(x, 1.0 / 3.0)
else x = (x * 7.787) + (16.0 / 116.0)

The resultant x value must be multiplied with this:

or...on a more simplified way. Multiply x with:

probably it will result on a greenish image if you don´t adjust "a" and "b' to also stays withinh the range of luma, but this is (in fact) what the CieLab equations are actually doing.

It´s only the final "x" that is incorrect. This fix, will also works once you convert it to CieLCH as well.

guga · February 27, 2019, 04:27:15 AM

Quote from: jj2007 on February 27, 2019, 03:49:33 AM
Quote from: guga on February 27, 2019, 01:37:12 AMJJ. Your fcmp macro is actually a function, right ? movups loads 4 Real4 at once, is that it ?

A macro combined with a function that expects two numbers on the FPU.

SetFloat xmm0=1234567890.1234567
SetFloat MyR4=1234567890.1234567 ; MyR4 is a REAL4
deb 4, "Comparing two floats:", MyR4, f:xmm0
Fcmp MyR4, f:xmm0, high ; compare the two with high precision (top would be strict, medium more tolerant)

This creates the following under the hood:
Code Select Expand
0040143F ³. DDC7 ffree st(7) ; float 1234567890.1234567000 00401441 ³. 83EC 08 sub esp, 8 ; create a REAL8 slot 00401444 ³. 0F130424 movlps [esp], xmm0 ; load xmm0 into slot 00401448 ³. DD0424 fld qword ptr [esp] ; push value on FPU 0040144B ³. 83C4 08 add esp, 8 ; correct stack 0040144E ³. DDC7 ffree st(7) 00401450 ³. D943 8A fld dword ptr [ebx-76] ; MyR4, the REAL4 00401453 ³. 6A 2D push 2D ; requested precision 00401455 ³. E8 46100000 call MbFloatCmp ; ÀFcmp.MbFloatCmp 0040145A ³. 75 2B jnz short C0004

Great work. :t :t :t :t

At the end what it is doing is comparing the values in FPU, rather then SSE, right ?

I was thinking in something similar to the one as in the "If" (Same as in the masm version) or "Fpu_If" (In rosasm.I don´t know if there is a similar one for masm) Macro using the SSE instructions more directly, similar to what you did with the comisd instruction.

jj2007 · February 27, 2019, 05:36:26 AM

Quote from: guga on February 27, 2019, 04:27:15 AMGreat work. :t :t :t :t

Thanks

QuoteAt the end what it is doing is comparing the values in FPU, rather then SSE, right ?

Yes, more or less. There was a long thread discussing various options some years ago.

If you want the real fun, try this:

include \masm32\MasmBasic\MasmBasic.inc ; download
SetGlobals temp:REAL10
Init
SetFloat temp= 123.45678901234567891 ; temp is a REAL10
SetFloat ST(0)=123.45678901234567890
deb 4, "Comparing two floats:", ST(0), temp
Fcmp temp, ST(0), xtra
.if Zero?
Print Str$("%Jf==", ST(0)), Str$(temp)
.elseif FcmpLess
PrintLine Str$("temp %Jf", temp), Chr$(60), Str$(ST(0))
.else
PrintLine Str$("temp %Jf", temp), Chr$(62), Str$(ST(0))
.endif
Inkey CrLf$, "comparing floats is fun!!!!"
EndOfCode

Output:

Code Select

Comparing two floats:
ST(0)           123.4567890123456789
temp            123.4567890123456789
temp 123.4567890123456789>123.4567890123456789

guga · February 27, 2019, 07:39:07 AM

Good news :)

1st tests when converting Luma to interger was ok. And as a result of the convertion, when i compensate Luma to be adjusted to Chroma, no more errors whatsoever on extrapolations of the sqrt(29) limit. It seems that the compensation ratio adjusts Chroma and Hue on such a way that it will always be within their boundaries.

I´ll make a minor test on the backward function (it do needs to calculate the reversed compensation ratio, but...no problem. It won´t affect performance. It will be needed to check the limits anyway, so, a pointer to another member of the structure won´t make any significant difference in terms of speed, and any eventual loss of speed can be compensated later when i start to give a try and convert it to SSE and Real4 to be used in the matrix. I just hope we won´t loose accuracy after converting all variables to Real4 rather then Real8)

The errors on the algorithm when show up are a kind of hide and seek play

One question. How to calculate the probability or density of a certain range of data ?

THis is to analyse what CieLCH is doing with the pixels. I mean, after the fixes, i suceeded to map all 16 million pixels on another table from 0 to 255 (also biased on the luma), and figure it out that the distribution is almost proportional (or making a sine curve, perhaps ?) all over the table.

For example:
Gray/Luma = 0, total pixels = 38
Gray/Luma = 1, total pixels = 250
Gray/Luma = 2, total pixels = 330
....
Gray/Luma = 142, total pixels = 128694
after that it starts decreasing again.
Gray/Luma = 143, total pixels = 108965
Gray/Luma = 144, total pixels = 107555
...
Gray/Luma = 255, total pixels = 0

It seems that i suceeded to adjust CieLab equations to behave as a parabolic cylinder, but the midpoint seems to be at 142 and the decresing values does not match to their opposite side.

So, how to "equalize" the distribution of a range of data in order to it results in something like this:

Gray/Luma = 0, total pixels = 38
Gray/Luma = 1, total pixels = 250
Gray/Luma = 2, total pixels = 330
....
Gray/Luma = 126, total pixels = 100000
Gray/Luma = 127, total pixels = 128694 ; midpoint
Gray/Luma = 128, total pixels = 128694 ; midpoint
after that it starts decreasing again.
Gray/Luma = 129, total pixels = 100000
...

Gray/Luma = 253, total pixels = 330
Gray/Luma = 254, total pixels = 250
Gray/Luma = 255, total pixels = 38

guga · February 27, 2019, 11:28:18 AM

1st result using Integer as input and output as Luma settled at 49% :t

The fix on the limits for Chroma and Hue shift are not done yet. I´m currently working on it

SheHulk is looking better

Siekmanski · February 27, 2019, 11:47:52 AM

:t

I've made her an avatar from Pandora

guga · February 27, 2019, 11:59:45 AM

guga · February 27, 2019, 07:50:01 PM

SheHulk is getting mature, now

Hi guys, i fixed Luma and Chroma. Now Luma is a integer from 0 to 255 and Chroma (Which, really is a intensity of Luma, as i mentioned before) i settled it as a Real8 number from 0 to 100. The hue shift i´ll solve the same way i did for Chroma. It is needed to it stay also within the boundaries of Luma inside the LUT. There is a minor, tiny loss of chroma (around 2%) due to the shifting of Hue yet.

On this test, i reduced Luma to 49% and kept chromacity percentage untouched. Once i fix the hue shifting problem, i´ll see the best strategy to link both (Hue and chroma) and start creating some flags to be used during the output (CieLCH to RGB)

Siekmanski · February 27, 2019, 08:09:12 PM

:t

FORTRANS · February 28, 2019, 12:55:31 AM

Hi,

Quote from: guga on February 27, 2019, 07:39:07 AM
So, how to "equalize" the distribution of a range of data in order to it results in something like this:

Gray/Luma = 0, total pixels = 38
Gray/Luma = 1, total pixels = 250
Gray/Luma = 2, total pixels = 330
....
Gray/Luma = 126, total pixels = 100000
Gray/Luma = 127, total pixels = 128694 ; midpoint
Gray/Luma = 128, total pixels = 128694 ; midpoint
after that it starts decreasing again.
Gray/Luma = 129, total pixels = 100000
...

Gray/Luma = 253, total pixels = 330
Gray/Luma = 254, total pixels = 250
Gray/Luma = 255, total pixels = 38

Well I did this on gray scale images. I made a quick look, but
I did not find my code for a Gaussian distribution. What I did
was read in the pixel data, assigned a number corresponding to
its position in the picture, and then sorted the data. I then
replaced the data with values that followed the desired distribution
and unsorted the data back to reform the image.

A brute force method, but easy to program. A better way is
to count up the data values and apply an algorithm to replace
an input data value with a value that will create the desired
distribution. A somewhat harder programming job, as a given
input value may need to map to multiple output values. So
you may actually have to think a bit when writing your program.

I did find an example with a uniform output distribution that
can show that result if wanted.

Cheers,

Steve N.

guga · February 28, 2019, 03:16:53 AM

Tks, Steve

QuoteWell I did this on gray scale images. I made a quick look, but
I did not find my code for a Gaussian distribution. What I did
was read in the pixel data, assigned a number corresponding to
its position in the picture, and then sorted the data. I then
replaced the data with values that followed the desired distribution
and unsorted the data back to reform the image.

If you find it, can you post for me to read ? I did something like that to equalize the histogram on gray as well, but don´t know if it is ok. I´m trying to understand better how to equalize the whole 16 million of colors rather then an image itself. The distribution of all 16 million colors on the CieLCH using AdobeRGB, after the fixes i made are now more or less equally distributed, but i don´t know exactly how it reached that way. It looks line a sine distribution, but i´m clueless how to create a equation to retrieve the actual math involved with this kind of distribution. I mean, i want to find the proper equation that generated that result (The distribution), fix the math on it (if necessary) in order to it distribute the values/pixels following that equation.

As you see on the attached image, When i reduced luma to 50%, the algorithm generated a similar distribution as the original one (except, reduced to half of it´s pixels.). It is still shifting the hue, since i didn´t made the proper fix yet (hard to follow all those equations, btw and try to develop a strategy to avoid shifting

)

QuoteI did find an example with a uniform output distribution that
can show that result if wanted.

Please, can you upload it or have a link to see this ?

Tks a lot

FORTRANS · February 28, 2019, 08:12:38 AM

Hi,

Okay, the last version of the program did uniform weighting and
was written in 2001. (And still had errors.) The Gaussian distribution
version was done in 1995. Its output looks a little rough, see
attached histograms with a maximum bin count.

Old FORTRAN is sometimes hard to figure out when you don't
remember writing it any more. I can post some of the code if you
want, but it follows my description above as far as I can see.
And it was for rescaling an image to improve the dynamic range.

Regards,

Steve

guga · February 28, 2019, 01:22:28 PM

Tks, Steve. Pls post. :icon_cool: :icon_cool: I would like to read to understand it better. :t :t :t

Regards

guga

FORTRANS · March 01, 2019, 02:23:02 AM

Quote from: guga on February 28, 2019, 01:22:28 PM
Tks, Steve. Pls post. :icon_cool: :icon_cool: I would like to read to understand it better. :t :t :t

Regards

guga

Hi,

Okay. This should be it. I deleted some debugging code
and a comment was changed to make more sense, sort of.
But I don't think anything meaningful was changed. Not that
this code is meaningful anyway.

Enjoy,

Steve

The MASM Forum

News:

Fast Compare Real8 with SSE and ColorSpaces

jj2007

guga

guga

jj2007

guga

guga

Siekmanski

guga

guga

Siekmanski

FORTRANS

guga

FORTRANS

guga

FORTRANS