Fast Compare Real8 with SSE and ColorSpaces

guga · February 25, 2019, 05:30:39 AM

Hmm...Ok..If you think that we can not be in risk of having an cascade effect, we can try with real4 on the outputed results of Hue.

Btw...I`m close enough to a fix

.

I found this constant to be multiplied with x before it is transformed to "a' "b"

1- [(1/(8.8564516e-3)^1/3)^2/29]*0.628609323645896273706847552307563478876198609541738700849

The true value of X in order to stay in the limits of sqrt(29) (To produce a valid hue) should have divide by (almost) it´s half. (Which make sense, if we consider CieLCh as a cylinder whose x axis are, in fact, it´s half (The radius). Probably the error was that "x" in original CieLCh equation is the diameter and not the radius of the cylinder ? )

0.4936202640503811484389906431356658372727441795149125

This number generated only 39208 errors from all of the 16 million colors. . Those few errors were extrapolations of sqrt(29) for hue on an error margin of something around 1e-3. I`m too damn close to a constant value

. Pretty sure that this is caused by the limit of the maximum Y be bit smaller then 1. I´ll do some more checks before trying to use this constant in the other color models to tests if it is really a valid one.

guga · February 25, 2019, 06:45:06 AM

Ok, for HDTV, C and D65 Models (2º degree observer) the amount of errors were small. Less then 40000 among 16 million colors. (All errors with a tiny margin of error of something around 1e-3)

For D50 models (prophoto, Beta, colormatch) , it have thounsand of errors, but all of those, on a small margin of 1e-3, except 1 error of something around 1.28
For CiE_RGB_E it also have huge errors and all of them on smaller margins (1e-3), except one that resulted on an error of 1.26

Amazingly, NTSC RGB "C" did not produced any error.

I´m building a small fix using that constant and an error check for the maximum Y to see if it will decrease even more the errors. I know the error margin is really small, but it is there yet.

From all those errors, i´m thinking that, in fact, CIE made an incorrect analysis of CieLab/CieLCH colorSpace. The x from XYZ when transformed to lab/Hue/Chroma represents the radius of the cylynder (this colorspace is a parabolic cylinder) and not the diameter as, apparently was being considered on the original equations. One thing is for sure now. When converting to "a" "b", x must be divided by it´s half (almost the half, in fact).

One good thing is that, now, Chroma is restricted to something around 100. I mean, it´s delta have a difference of something in between 100 to 118 on each luma range. So, it is likely that we can use integers as well :)

After fixing this with Y, i´ll made the tests to see if the backwards conversion also works on this way. (let´s hope it works too :icon_mrgreen:)

guga · February 25, 2019, 12:13:13 PM

1st tests on the reverse function (CieLCH to RGB).

The structure/texture of the image is preserved, also the general light. Needed only minor adjusts that are still extrapolating the limit of sqrt(29) on the hue.

The "greenish" is normal, because i didn´t adjusted the reversed function to be limited on the LUT Table, thus, it is shifting the hue. This was only a test to see if the backward function will still recover the RGB back with and without adjusting the luma. Remember i told the claims of CIE that Luma was totally independent was incorrect ? Luma affects hue (and also chroma). This is why some limits needs to be established on the LUT table to avoid shifting.

I´m finishing one or 2 functions to force the X Axis ratio to obey the limit for hue and later i´ll fix the CieLCHtoRGB to stay inside the limits for Luma, Hue and Chroma accordying to the Ws_Matrix structure (The LUT table)

guga · February 26, 2019, 12:42:26 AM

Hi Guys,

i´m starting a couple of fine tune adjustments on the equations and will start converting the outputs to integers.

It is better to we use integers as output for Luma, Chroma and Hue, right ?

From the results i´m seeing so far, i plan to make the range of integers be the same as in R, G, B. So, Luma, instead output a Real8, it will output an Integer value. The same can be done with Chroma. I´m not sure, however, what could be the best strategy for Hue.

For example, once i convert Luma to Integer, i´ll be forced to make a minor adjust to compensate potential loss of Chroma values. I chosen to make the adjustments in Chroma, since Hue (that is the actual thing that creates the color) needs to remain unchanged.

From the formula i developed, Hue is simply the sum of a constant (atan(5/2) with the asin of Luma/Chroma. So, in order to keep Hue intact, and since i´m using a Table of Lumas, all i have to do is create a multiplicand ratio to be used in Chroma. Ex: Chroma*k
where k = the compensation ratio

The value of k can be calculated as:

k = (NextLuma+16)/(PreviousLuma+16)

Where "NextLuma" is the next value of the Luma in the Ws_Matrix table and previousLuma, follows the same logic.

Ex: In the Ws_Matrix structure (Our LUT) i have:

QuoteGray/LumaIndex Luma Value
(...)
100 58.69656
101 59.01234
(...)

So, all is needed is calculate the compensation ratio, like:
(59.01234+16)/(58.69656+16)

And use the resultant value to multiply with Chroma on the RGBtoCieLab function (probably it will be needed the multiplication only on this fucntion and not on it´s reversal - CieLChtoRGB, afterall it will be presumed this correction fraction will be already done, since we are using integers on CieLCh to RGB too.

The compensation ratio in between each Luma is really small (approximately 1.05 to a max of 1.002), but it is important to avoid cascading errors and also keep the maximum amount of pixels (the 16 millions of colors) ready to be used on the other parts of the function and it´s reversal.

At 1ft, the compensation of Chroma from Luma is the easiest part and won´t affect the general performance of RGBtoCieLCH.

About Chroma using integers, it can be done (From 0 to 255, too). However, it won´t be able to do the same compensation as made in Luma, since Chroma is a fraction of Luma, and thus, it will simply revert luma to a Real8 if i also do that compensation for chroma.

Then i have 2 alternatives.

a) Creating a compensation ratio specific for Chroma, but will affect Hue angle (Won´t affect too much, btw, but i really don´t know the actual results yet. I mean, even if i increase hue on the RGBtoCieLCh, in theory this can be reverted back on the CieLCHtoRGB)
or
b) Accept eventual the loss of accuracy (In hue) and make chroma uses only 255 values without any compensation

Question #1 - What do you prefer ?

Question #2 - If i use Hue also as integer, it will be better i create a output parameter on RGBtoCieLCH where the Real8 value of Hue will be stored (To display, for example, on a editbox, staticbox etc), . Sure, Hue will be also used as an index from the LUT, but the user may wants to actually see what is the Hue he is working with. So, add an extra parameter or keep everything as integers and let the user calculate hue from the index by his own ?

Most Likely, it is possible to add some special flags on the CieLCHtoRGB function allowing the user to shift the hue all over it´s range, or limited to it´s own luma range, or also link all values together (luma, chroma, hue can be automatically corrected), or link only (Chroma/hue), or even recoloring an image using only the minimum or maximum hue for each luma range etc etc. Not sure yet, what can be added as a flag, since i didn't started to work on this particular part of the backward function (That needs correction 1st to stay in within the limits of LUT).

daydreamer · February 26, 2019, 05:17:10 AM

congrats on making a shehulk app

integers vs floats,check raymonds page about fixed points a 3rd alternative
for example you can have integer part+fraction part for better precision than 0-255 integers,but also fast

guga · February 26, 2019, 06:24:54 AM

Quote from: daydreamer on February 26, 2019, 05:17:10 AM
congrats on making a shehulk app

I would marry her, but she´s not mature yet.

Quoteintegers vs floats,check raymonds page about fixed points a 3rd alternative
for example you can have integer part+fraction part for better precision than 0-255 integers,but also fast

Here, the problem won´t be speed, but strategy itself to make it easier to find the values on the look up table. All the Floating Point data are still there, hidden and precalculated. It won´t affect precision (At least, in what concerns luma whose ranges between the table are really really small.
Ex: When luma is on the range of gray = 168 (The index in the look up table), it´s value is 72.325013998827 and on the next range (169), it is 72.658443961269242.

We can force luma to stay on it´s minimum value always and compensates increasing the next Chroma on the same range a little bit (That is mainly a intensity of luma). I didn´t tested yet, but it is likely that it won´t affect the results changing luma. For chroma, i´m not so sure, since the range between the previous and next chroma is big (But, probably we can use integers to it always stay in between 0-255 too)

For example, for chroma we have on that same range as luma
Index = 168, Chroma Max = 264.299252280022415, Chroma Min = 153.273578770510227.

See ? Chroma range is always around 100 on each luma range. So we can simply creating a index and use it as a percentage. For, example, the function can be like this:

call CieLCHtoRGB, 150, 75, PointertoHue, OutRed, OutGreen, OutBlue

The 1st parameter is the integer value for luma (the index). So, when we input 150, it will look on the LUT for the correspondent Real8 value and start calculating the rest of the variables.
About chroma, when we do 75, he can also do the same, but will generate the true value of chroma from a percentage. (But instead 0 to 100, using 0 to 255 to force it to achieve a bit more accuracy)

Since the difference in Luma (using integers) i can compensate increasing on Chroma (That, by consequence, will adjust hue to it always be in it´s limits), i maybe can also adjust chroma on that same way.

Each Luma range has his unique values for Chroma and Hue, that´s why i thought in use at least luma and chroma as integer so we can make it a simple index rather than a float/real8 value. For example, when chroma = 130, it will be, necessarily to stay in another luma range and no longer in 168.

Also, when Luma Range is 168 we have 'only" 102966 pixels that falls on this range distributed among a Hue range from 152.972º to 197.494º. So it is something around 2000 pixels per hue only. If we divide chroma by 255 it will then be 9 pixels only per chroma/hue. Considering that those 9 pixels are very similar to each other, using integers, probably, won´t affect the general looks of the image.

The maximum it will happens (in theory) is increase a little bit the range of Hue on each luma range,but won´t affect the conversion of the pixels, since the equations are now fixed.

Didn´t tested yet, but...it seems valid, and is a form to use the method proposed by Marinus as well, but using precalculated ratios, than creating more tables.

Siekmanski · February 26, 2019, 06:56:41 AM

QuoteIt is better to we use integers as output for Luma, Chroma and Hue, right ?

If I speak for myself, the most convenient and fastest way is convert the 32 bit pixels to 4 floats, do your math on them and when done, you convert them back to integer 32 bit pixels to present them to your screen.

This can be done very fast in SSE2, all 4 ARGB members in one go.
And without shifting them in place. 8)

Code Select


.data
align 16
ARGB_pixels	dd 0ff010203h,0ff040506h,0ff070809h,0ff0a0b0ch

.code

    pxor        xmm6,xmm6                   ; Empty the source operand, to zero the integer high parts,
                                            ; inside the "punpcklbw" and "punpcklwd" instructions
    movdqa      xmm7,oword ptr ARGB_pixels  ; Load 4 ARGB pixels at once

    movq        xmm0,xmm7                   ; Load first 32 bit ARGB pixel
    punpcklbw   xmm0,xmm6                   ; Convert 4 bytes to 4 words
    punpcklwd   xmm0,xmm6                   ; Convert 4 words to 4 dwords
    cvtdq2ps    xmm0,xmm0                   ; Convert 4 dwords to 4 real4 values

; Do all floating point calculations here
; when done, convert 4 real4 values at once to a 32 bit ARGB pixel

    cvtps2dq   	xmm0,xmm0                   ; Convert 4 real4 values to 4 dwords
    packssdw    xmm0,xmm0                   ; Convert 4 dwords to 4 words
    packuswb    xmm0,xmm0                   ; Convert 4 words to 4 bytes 
    movd        dword ptr [edi],xmm0        ; Store as 32bit ARGB format

Siekmanski · February 26, 2019, 07:35:39 AM

What Magnus proposed is also very handy.
A fixedpoint ( 16:16 or whatever range or precision you need ) index pointer.
I have used this very often in sound programming for changing frequencies or for timing calculations by counting fractions of the sound samples.

guga · February 26, 2019, 09:21:36 AM

Hi marinus,

I´m not sure i´m understanding this. I didn´t used fixed points yet to be used as an index.

About SSE, we can use it. However, each byte is used as an index, so, how to load them with SSE to point to the proper location ?

For example:
Pixel1 = 0, 175, 15, 20
Pixel2 = 0, 35, 11, 150
Pixel3 = 0, 98, 18, 14
Pixel4 = 0, 5, 15, 10

Even if i load all 4 pixels at once, i still would need to take each byte separated, convert it to a dword and point it to the proper location on the table, and then do the math on each value grabbed by the index. I realize this is faster, but never did like that before using SSE. Also, there´s a problem when the image/video is using pitch to point to the proper address (like in VirtualDub, for example), how to overcome this ? And what if the image width or height is not a multiple of 4 ?

If you and Magnus could provide a working example it would be better to visualize all of this an try to make it work that way. Examples of fixed point to be used like that, and the retrieval of the values on a table using SSE after being converted from byte, then it would be better for me to understand how this can be done.

I´m trying to optimize the functions using pointers to tables but i´m clueless how to make it faster using SSE or fixed point.

Also for the output, what we will see for chroma, for example, wouldn´t be better it always starts from 0 to 100 or 0 to 255, considering that Chroma is also a range linked to the range of luma ?

For example, the outputs (Limits) of all limits i've got so far are in the form of a table, each one of them on it´s own range.
Example:

Quote
Gray/Index Luma Chroma Min Chroma max Hue Min Hue Max
100 65.54654 35.456 85.659 118.65º 175.656º
...
255 100 253.79 253.85 180º 180º

and so on

Those values are pre-calculated and if the user inputs, for example, Red = 100, it will then, look for the limits on this table above, simply pointing to offset "100" and taking the values from there in order to check the boundaries.

For RGBtoCieLCH, it will do the math to it calculate Chroma, Hue and Luma (after using red, green and blue also as pointers to other tables, like the Kfactor map that contains pre-calculated all the gamma to be multiplied to the tristimulus matrix)

For CieLCHtoRGB, it will then use the above table to check the limits. For example, if you input Chroma = 150.11254 and Index(Luma/Gray) = 100, it will 1st check if the chroma you inputed is inside the Min and Max Chroma. If it´s inside, it will continue and do the math to convert it back to RGB. If not, it will try to fix the inputed values. The fix can be made, pointing Chroma to the maximum value, since there´s no chroma of 150.11 and the nearest value is 85.659.

That´s why i thought in using integers, because we can use the integer values 0 to 255 as a percentage. On this way, you can´t input incorrect values anyway, because if you use as input for Chroma = 128 (for example), it will calculate the chroma as half of the range from the min and max (85.659-35.456)/2 and thus it will always stays within the limits for each range on the table.

Siekmanski · February 26, 2019, 11:26:01 AM

It all depends on your style of coding.
There are many many ways to get the same result.

The only advantage of SIMD is, you can do multiple calculations at once.
In the case of color conversions, SIMD is ideal.
You can calculate all 3 RGB values in parallel.
And with some horizontal and vertical shuffles you can do cool things too.

Using a real4 value as an index pointer:

- Convert the 4 real4 values to 4 dwords ( cvtps2dq xmm0,xmm0 )
- Shuffle the dword you need to position 0 ( shufps xmm0,xmm0,Shuffle(0,3,2,1) )
- Copy the dword to a register ( movd eax,xmm0 )
- Use eax as the index pointer for the LUT.
- Shuffle the next dword in position etc.

This way you don't need fixedpoint integers to get a value from a LUT.
And your LUT can be any size if you use a factor which gives you the correct position in the LUT.

SSE2 also has min max instructions, useful to clamp ranges, as in your algorithm for chroma and hue.
You can clamp all 2 of them at once with 2 instructions.

Minimum_mask    real4 0.0,0.0,35.456,118.65
Maximum_mask   real4 0.0,0.0,85.659,175.656

maxps   xmm0,Minimum_mask
minps   xmm0,Maximum_mask

guga · February 26, 2019, 12:34:43 PM

Amazing !

I really will have to start trying to port the code to SSE. I rarely rarely use those instructions and it is a bit hard for me to follow yet.

About shufps, what is the macro "Shuffle(0,3,2,1)" ?
Can i write it as:

Quote
[SHUFFLE | (#4 or (#3 shl 2) or (#2 shl 4) or (#1 shl 6) ) ] ; used as a paramacro.

shufps xmm0 xmm0 {SHUFFLE 0,3,2,1}

Is this correct ? And if i want to use the position as {SHUFFLE 3,2,0,1} ? The value will always be positioned at "0" and the other values will be exchanging their pos, right ?

I´ll start adjusting the code and fix the backwards functions. Once i finish i´ll give a try in porting all of it to SSE2. I´m just worry in trying to make it portable to others cases when we have a RGBQUAD usage rather then a ARGB etc... using this shufps may also solve this in order to keep portability.

SSE macros may also be very handy to use. Do you have any that can make comparisons ? As the one we created for FPU in Rosasm ?

Code Select

Fpu_If R$MyVar = R$Zero
; do this:
Fpu_End_If

Which unrolled produces:

    FLD R$Zero
    FLD R$MyVar
    FCOMPP
    FSTSW AX
    FWAIT
    SAHF
    JNE L0>>

; Do this

L0:

Or...

Code Select

Fpu_If R$MyVar > R$Zero
; do this:
Fpu_End_If

Which unrolled produces:

    FLD R$Zero
    FLD R$MyVar
    FCOMPP
    FSTSW AX
    FWAIT
    SAHF
    jna L0>>

; Do this

L0:

How to make similar ones (for comparisons) using SSE 2 ?

jj2007 · February 26, 2019, 03:42:36 PM

Quote from: guga on February 26, 2019, 12:34:43 PMDo you have any that can make comparisons ?

include \masm32\MasmBasic\MasmBasic.inc ; download
Init
SetFloat xmm0=1234567890.12345
SetFloat xmm1=1234567890.12346
PrintLine "default precision:"
Fcmp xmm0, xmm1
.if Zero?
PrintLine Str$(f:xmm0), "==", Str$(f:xmm1)
.elseif FcmpLess
PrintLine Str$(f:xmm0), Chr$(60), Str$(f:xmm1)
.else
PrintLine Str$(f:xmm0), Chr$(62), Str$(f:xmm1)
.endif
PrintLine CrLf$, "top precision:"
Fcmp xmm0, xmm1, top
.if Zero?
PrintLine Str$(f:xmm0), "==", Str$(f:xmm1)
.elseif FcmpLess
PrintLine Str$(f:xmm0), Chr$(60), Str$(f:xmm1)
.else
PrintLine Str$(f:xmm0), Chr$(62), Str$(f:xmm1)
.endif
Inkey CrLf$, "comparing floats is fun!!!!"
EndOfCode

Code Select

default precision:
1234567890.12345==1234567890.12346

top precision:
1234567890.12345<1234567890.12346

comparing floats is fun!

Attached a project comparing directly xmm0 and ST(0).

Re shuffling, see SwapBytes

Siekmanski · February 26, 2019, 10:06:09 PM

Quote from: guga on February 26, 2019, 12:34:43 PM
Amazing !

I really will have to start trying to port the code to SSE. I rarely rarely use those instructions and it is a bit hard for me to follow yet.

About shufps, what is the macro "Shuffle(0,3,2,1)" ?
Can i write it as:
Quote
[SHUFFLE | (#4 or (#3 shl 2) or (#2 shl 4) or (#1 shl 6) ) ] ; used as a paramacro.

shufps xmm0 xmm0 {SHUFFLE 0,3,2,1}

Is this correct ? And if i want to use the position as {SHUFFLE 3,2,0,1} ? The value will always be positioned at "0" and the other values will be exchanging their pos, right ?

No, I used the shufps instruction for rotating the 4 positions within xmm0 to extract each dword from pos 0 - 3

Code Select

    movd    eax,xmm0                    ; get dword from position 0 = 0     [3 2 1 0]
    shufps  xmm0,xmm0,Shuffle(0,3,2,1)  ; positions   [0 3 2 1]
    movd    eax,xmm0                    ; get dword from position 0 = 1        
    shufps  xmm0,xmm0,Shuffle(0,3,2,1)  ; positions   [1 0 3 2]
    movd    eax,xmm0                    ; get dword from position 0 = 2
    shufps  xmm0,xmm0,Shuffle(0,3,2,1)  ; positions   [2 1 0 3]
    movd    eax,xmm0                    ; get dword from position 0 = 3

If you want for example, position 2 directly:

Code Select

    shufps  xmm0,xmm0,Shuffle(3,0,1,2)  ; positions 2 and 0 are swapped
    movd    eax,xmm0                    ; get dword from position 0

Keep in mind that the positions stay in the shuffled state.
So you need to keep track of the changed positions within the xmm register after each shuffle.

If you don't like this, you can use pshufd and shuffle the value you need directly to another xmm register.

Code Select

    movd    eax,xmm0
    pshufd  xmm1,xmm0,Shuffle(0,0,0,1)
    movd    eax,xmm1
    pshufd  xmm1,xmm0,Shuffle(0,0,0,2)
    movd    eax,xmm1
    pshufd  xmm1,xmm0,Shuffle(0,0,0,3)
    movd    eax,xmm1

The shuffle macro I use:

Shuffle MACRO V0,V1,V2,V3
EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

common used compare instructions:

comiss comisd
ucomiss ucomisd
cmpps cmpss cmpps cmppd
pcmpeqb pcmpeqw pcmpeqd
pcmpgtb pcmpgtw pcmpgtd , and many more....

Code Select

Equal CMPEQSS 
Equal CMPEQPS 
Less Than CMPLTSS 
Less Than CMPLTPS 
Less Than or Equal CMPLESS 
Less Than or Equal CMPLEPS 
Greater Than CMPLTSS 
Greater Than CMPLTPS 
Greater Than or Equal CMPLESS 
Greater Than or Equal CMPLEPS 
Not Equal CMPNEQSS 
Not Equal CMPNEQPS 
Not Less Than CMPNLTSS 
Not Less Than CMPNLTPS 
Not Less Than or Equal CMPNLESS 
Not Less Than or Equal CMPNLEPS 
Not Greater Than CMPNLTSS 
Not Greater Than CMPNLTPS 
Not Greater Than or Equal CMPNLESS 
Not Greater Than or Equal CMPNLEPS 
Ordered CMPORDSS 
Ordered CMPORDPS 
Unordered CMPUNORDSS 
Unordered CMPUNORDPS 
Equal COMISS 
Less Than COMISS 
Less Than or Equal COMISS 
Greater Than COMISS 
Greater Than or Equal COMISS 
Not Equal COMISS 
Equal UCOMISS 
Less Than UCOMISS 
Less Than or Equal UCOMISS 
Greater Than UCOMISS 
Greater Than or Equal UCOMISS 
Not Equal UCOMISS

some need some extra explanation:

Code Select


Pseudo-Op            CMPPS Implementation

CMPEQPS xmm1, xmm2    CMPPS xmm1, xmm2, 0 
CMPLTPS xmm1, xmm2    CMPPS xmm1, xmm2, 1 
CMPLEPS xmm1, xmm2    CMPPS xmm1, xmm2, 2 
CMPUNORDPS xmm1, xmm2 CMPPS xmm1, xmm2, 3 
CMPNEQPS xmm1, xmm2   CMPPS xmm1, xmm2, 4 
CMPNLTPS xmm1, xmm2   CMPPS xmm1, xmm2, 5 
CMPNLEPS xmm1, xmm2   CMPPS xmm1, xmm2, 6 
CMPORDPS xmm1, xmm2   CMPPS xmm1, xmm2, 7

Just look them up in the Intel Software Developer's Manuals:

https://software.intel.com/en-us/articles/intel-sdm

The next document contains the full instruction set reference, A-Z, in one volume.
Describes the format of the instruction and provides reference pages for instructions.
This document allows for easy navigation of the instruction set reference through functional cross-volume table of contents, references, and index:

https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf

guga · February 27, 2019, 01:37:12 AM

Tks, Marinus and JJ :t :t :t

I´ll take a look on those and try to learn the SSE coding. It is really interesting.

JJ. Your fcmp macro is actually a function, right ? movups loads 4 Real4 at once, is that it ?

Marinus, i´ll start to read the manual this afternoon. In the meanwhile i found this from the other manual (https://fizyka.umk.pl/~daras/mtm/26_IA32-1-sse.pdf)

Quote11.6.12. Branching on Arithmetic Operations

There are no condition codes in SSE or SSE2 states. A packed-data comparison instruction generates a mask which can then be transferred to an integer register. The following code sequence provides an example of how to perform a conditional branch, based on the result of an SSE2 arithmetic operation.

cmppd XMM0, XMM1; generates a mask in XMM0
movmskpd EAX, XMM0; moves a 2 bit mask to eax
test EAX, 0,2 ; compare with desired result
jne BRANCH TARGET

The COMISD and UCOMISD instructions update the EFLAGS as the result of a scalar comparison. A conditional branch can then be scheduled immediately following COMISD/UCOMISD.

I gave a try using a Real8 to load the data and came up with this:

Code Select


[SSEData2: R$ 17.656]
[SSEData3: R$ 25.656]

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1 ; "Test_if is a RosAsm macro. Unrolled is the same as "test eax 1 | jne L2> ; do this L2:"
        ; do this
    Test_End

The comparision was ok.

Then i tried a variation of this converting the Real4 to Real8 like this:

Code Select


[SSEData2b: F$ 17.656] ; Real4
[SSEData3b: F$ 25.656] ; Real4

     cvtss2sd xmm0 X$SSEData2b
    cvtss2sd xmm1 X$SSEData3b
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         mov eax eax
    Test_End

Also was ok. I tried to convert from Real8 to Real4 using cvtpd2ps and it was ok, too.

But, when i tried to do both in sequence, it occured an error

Code Select


    ; A variante using movq to load the data.
    movq xmm0 X$SSEData2
    movq xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1 ; Comparision was Ok, it go to the next line 
;         do this
    Test_End

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1 ; here the error happened. It jumped over the test !
         ; do this.
    Test_End

The error was because the 1st cmppd produced a QNAN on the Loquadword of xmm0, resulting in "0 QNAN" when visualizing the Packed double on RosAsm debugger.

And when i tried to compare again, the 2nd value to be compared in the xmm0 contained the QNAN, causing eax to return 0, rather then 1, right ?

So, the soluton is always clear xmm0 and xmm1 before loading them to those registers as:

Code Select


    xorps xmm0 xmm0 ; clear registers to avoid errors
    xorps xmm1 xmm1

    movq xmm0 X$SSEData2
    movq xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         Do this
    Test_End

    xorps xmm0 xmm0 ; clear registers to avoid errors
    xorps xmm1 xmm1

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         Do this
    Test_End

But, is there a equivalent to fcompp in SSE2 to avoid having to use 2 extra opcodes to clear xmm0 and xmm1 before being used ?

I´m asking this, in order to try to create some easier to read macro using this SSE2 opcodes for branching prediction. JJ showed how to use comisd also for comparisons similar to these ones.

But...can SSE2 compare 2 or 4 data in xmm0 (lo and Hi quadword) at the same time ?

I tried this, but didn´t understood the results in eax

Code Select

[SSEData2a: R$ 8, 7]
[SSEData3a: R$ 2, 9]

    xorps xmm0 xmm0 ; clear registers to avoid errors
    xorps xmm1 xmm1

    cvtsd2ss xmm0 X$SSEData2
    cvtsd2ss xmm1 X$SSEData3
    cmppd xmm0 xmm1 1
    movmskpd eax xmm0
    Test_If eax 1
;         mov eax eax
    Test_End

It is comparing 8 with 2 and 7 with 9 at the same time, right ? So, what should be the expected values in eax ?

I mean, so, i could make the routines do 2 things at the same time according to the result in eax.
Ex:
if eax = 00_01 (binary) ; It means that 8 < 2
; do this
If eax = 00_10 ; it means that 7 > 2
; do that
If eax =00_11 ; both are smaller
; so this and that at

Is there a way to make those conditional branch at once ?

Siekmanski · February 27, 2019, 02:56:21 AM

For 2 real8 you have 2 sign states in eax witch gives you 4 possibilities ( movmskpd )
For 4 real4 you have 4 sign states in eax witch gives you 16 possibilities ( movmskps )

This gives you also the possibility to use a jump table to execute code without branching.
For RGB conversion you will be able to handle 3 sign bits ( 8 possibilities ).

So, 1 compare makes it possible to execute the code at once without branching.

For example this piece of pseudo code (you can see I'm learning xyz2lab

) where you can compare red green and blue at once and jump to 1 of the 6 possibilities.

   XR = 95.047
   YG = 100.000
   ZB = 108.883

   x = x / XR
   y = y / YG
   z = z / ZB

   if (x > 0.008856) x = pow(x, 1.0 / 3.0)
   else x = (x * 7.787) + (16.0 / 116.0)

   if (y > 0.008856) y = pow(y, 1.0 / 3.0)
   else y = (y * 7.787) + (16.0 / 116.0)

   if (z > 0.008856) z = pow(z, 1.0 / 3.0)
   else z = (z * 7.787) + (16.0 / 116.0)

The MASM Forum

News:

Fast Compare Real8 with SSE and ColorSpaces

guga

guga

guga

guga

daydreamer

guga

Siekmanski

Siekmanski

guga

Siekmanski

guga

jj2007

Siekmanski

guga

Siekmanski