Equivalence angle conversion in SSE2

guga · July 22, 2023, 10:02:01 AM

It seems harder then i thought. Using roundsd seems more easier, but it don´t have any similar opcode for SSE2. Can it be emulated ?

I gave a testa on arghhh bard, and it show me a code that could emulate roundsd, but made for visualstudio.

Code Select

int round_pd_sse2(double x) {
  // Isolate the fractional part of the floating-point value.
  double f = x - (int)x;

  // Round the fractional part to the nearest integer.
  int i = (int)f + 0.5;

  // If the fractional part is exactly 0.5, round to the nearest even integer.
  if (f == 0.5) {
    i = (i % 2 == 0) ? i : i - 1;
  }

  return i;
}

I doubt it would work, but can someone test or port it ? The correspnndant disassmbled code (accordlying to godbolt) is:

Code Select

__real@3fe0000000000000 DQ 03fe0000000000000r   ; 0.5

_f$ = -16                                         ; size = 8
tv75 = -8                                         ; size = 4
_i$ = -4                                                ; size = 4
_x$ = 8                                       ; size = 8
int round_pd_sse2(double) PROC                   ; round_pd_sse2
        push    ebp
        mov     ebp, esp
        sub     esp, 16                             ; 00000010H
        cvttsd2si eax, QWORD PTR _x$[ebp]
        cvtsi2sd xmm0, eax
        movsd   xmm1, QWORD PTR _x$[ebp]
        subsd   xmm1, xmm0
        movsd   QWORD PTR _f$[ebp], xmm1
        cvttsd2si ecx, QWORD PTR _f$[ebp]
        cvtsi2sd xmm0, ecx
        addsd   xmm0, QWORD PTR __real@3fe0000000000000
        cvttsd2si edx, xmm0
        mov     DWORD PTR _i$[ebp], edx
        movsd   xmm0, QWORD PTR _f$[ebp]
        ucomisd xmm0, QWORD PTR __real@3fe0000000000000
        lahf
        test    ah, 68                              ; 00000044H
        jp      SHORT $LN2@round_pd_s
        mov     eax, DWORD PTR _i$[ebp]
        and     eax, -2147483647              ; 80000001H
        jns     SHORT $LN6@round_pd_s
        dec     eax
        or      eax, -2                           ; fffffffeH
        inc     eax
$LN6@round_pd_s:
        test    eax, eax
        jne     SHORT $LN4@round_pd_s
        mov     ecx, DWORD PTR _i$[ebp]
        mov     DWORD PTR tv75[ebp], ecx
        jmp     SHORT $LN5@round_pd_s
$LN4@round_pd_s:
        mov     edx, DWORD PTR _i$[ebp]
        sub     edx, 1
        mov     DWORD PTR tv75[ebp], edx
$LN5@round_pd_s:
        mov     eax, DWORD PTR tv75[ebp]
        mov     DWORD PTR _i$[ebp], eax
$LN2@round_pd_s:
        mov     eax, DWORD PTR _i$[ebp]
        mov     esp, ebp
        pop     ebp
        ret     0
int round_pd_sse2(double) ENDP                   ; round_pd_sse2

Is this emulation correct ? Can it be possible to recreate the same functionality of roundsd using SSE2 instructions ?

daydreamer · July 22, 2023, 01:48:19 PM

Guga
SSE way of doing all kinds of math instructions it lack code itself
Advanced way :You can write a roundsd macro and in masm use nokeyword roundsd to make redefine roundsd mnemonic to macro instead possible
.data
Round real8 0.5,0.5
.code
Addpd xmm0,round

guga · July 22, 2023, 03:06:51 PM

Hi Daydreamer

Ok, but will it work for huge values when trying to create an equivalent angle for weird angles such as 4.845978e11, 4.567972e45 etc etc ?

I´m trying to see if there is a way to avoid the normnal limitations and make possible to retrieve whatever fraction part derives from a number after it is divided by 360, for example. So, no matter if the number is 1.235468e45, whenever it is divided by other value, it will always (or should, at least) results on a integer and a fractional part (that is the one, i´m trying to retrieve to calculate the equivalent angle)

I know it is really really unusual to happens someone using such values for calculating tanget, arctn, cos etc but it could be good to prevent strange situations when using such functions to handle color conversion routines.

I´m rebuiling an old routine i made from convert RGB to CIELCh that, in fact, have almost no limitations, but it uses tan and atan functions. Both i suceeded to port to SSE, but in very rare situations, we may have some calculations that ends on those weird numbers and, in order to avoid using limits inside the function, such as Jmp If above or below some limits etc, i´m trying to force the tangent function (or others trigonometric functions) to handle those situations more properly.

The old conversion routine i made for RGB to CieLCH have a error in design, because i ended creating a function to handle the luminance, Hue and Chroma as packets limited withing thesame levels of gray.

The result was not bad, because it ends to categorize pixels by a relation between their hues and luma, but killed the initiall funcionality of the convertion itself.

The problem of the converter (i suppose) is that even if i can attach a luma level (gray) to a certain hue level, the same thing seems to not work for chroma, which results on a image too blue, or green, or pink, etc if i change the variables i´m using. Perhaps, if i remove the limitation of chroma, the algorithm would work as expected insetad blueing the image whenever i decrease the luma.

Making a better tan/atan routines can not only speed up the code a lot, buut also help me fix those problems and find out what exactly is going on with the LCH relations that are causing the algo to fail.

TimoVJL · July 22, 2023, 10:11:50 PM

Is x64 in dead end, as most most accurate calculations moves to GPU ?

jj2007 · July 22, 2023, 10:32:02 PM

Quote from: jj2007 on July 21, 2023, 07:26:14 AMGenerally speaking, the Windows x64 ABI does allow the use of the FPU. From what I saw in The Laboratory in the past, the FPU is a) much more precise but b) not slower than SIMD code. The only convincing reason to not use it is if you have lots of data that you can process in parallel. That is the only area where SIMD shines.

Quote from: TimoVJL on July 22, 2023, 10:11:50 PMIs x64 in dead end, as most most accurate calculations moves to GPU ?

So you are adding a third case, the GPU. Do you have a link with an example? I've heard the GPU can do lots of useful things, but "accuracy" was not the prime concern there.

TimoVJL · July 22, 2023, 10:57:08 PM

No, but i read, that science world started some years ago using GPUs, as CPU can't help them enough.
Just check something about nvidia and why China is in black list.

guga · July 23, 2023, 11:08:20 AM

Ok, guys

Done the hardest part. Now i´m trying to find some exceptions or cases were handle weird values in other situations. So far, i suceeded to find equivalent angles of weird values such as:

840.41º = 150.41º
33e17º = 240º
2.45481458182487182147878e304º = 68.1458182487170915º
17e200º = 79.9999999999997157º
365º = 5º
1000º = 280º
900º = 180º
2.1754545e21º = 154.54499999999º

At the moment, this is for positive angles only. I´ll review the code and math to check for errors and later try to do the same for negative angles as well.

Once it´s done, i´ll post here the matematical equation i made for this weird thing and the full code and a small dll to we benchmark this.

jj2007 · July 23, 2023, 07:37:46 PM

Quote from: guga on July 22, 2023, 10:02:01 AMIt seems harder then i thought. Using roundsd seems more easier, but it don´t have any similar opcode for SSE2. Can it be emulated ?

Dear Guga,

this is a useless fight. You will have difficulties finding a machine with a cpu that doesn't support SSE4.1, and thus cannot use roundsd.

Now if you find that very, very old machine, just use the fpu. I would do that anyway, because it's perfectly suited to make these calculations. If your client complains that it is too slow (relevant only if parallel processing is possible), tell him to buy a new computer. It will be ten times faster than his old machine.

Remember zedd's problems with my SSE 4.2 Instr() implementation? He had that problem because his machine supports "only" SSE4.1.

A machine that does not support roundsd is over 15 years old (->Penryn, Nehalem).

InfiniteLoop · July 24, 2023, 01:11:34 AM

By "equivalence" do you mean finding the angle X % 2*Pi ?
This has already been done.

Y = X % 2Pi
==>
c = Floor(X/2Pi)
==>
Y = X - c*2Pi_Big - c*2Pi_Small

2Pi_Big = 2*Pi - 2*Pi_Small
2Pi_Small =Fractional(2Pi_Infinite_Precision << MANTISSA_BITS) * 2^-MANTISSA_BITS

raymond · July 24, 2023, 02:02:38 AM

Quote from: guga on July 23, 2023, 11:08:20 AMSo far, i suceeded to find equivalent angles of weird values such as:
2.45481458182487182147878e304º = 68.1458182487170915º

I know that I'm probably wasting my time, BUT

Unless you are definitely certain than the exact value of the given 2.45481458182487182147878 would continue with another 282 integer 0's, there is no way you can get the modulo 360º without using huge number procedures.

On the other hand, if any of those additional 282 integer digits could be anything else but 0's, stating that its modulo 360º is any specific value is mathematically WRONG. It could be anything between 0 and 359.

The same reasoning would apply to those other huge numbers.

daydreamer · July 24, 2023, 06:35:46 PM

If you use convert to 65536 instead of 360 ,and ebx,0fffh is faster than modulo
And can use ebx pointing to atan lut

guga · July 25, 2023, 12:10:57 AM

Quote from: raymond on July 24, 2023, 02:02:38 AM
Quote from: guga on July 23, 2023, 11:08:20 AMSo far, i suceeded to find equivalent angles of weird values such as:
2.45481458182487182147878e304º = 68.1458182487170915º

I know that I'm probably wasting my time, BUT

Unless you are definitely certain than the exact value of the given 2.45481458182487182147878 would continue with another 282 integer 0's, there is no way you can get the modulo 360º without using huge number procedures.

On the other hand, if any of those additional 282 integer digits could be anything else but 0's, stating that its modulo 360º is any specific value is mathematically WRONG. It could be anything between 0 and 359.

The same reasoning would apply to those other huge numbers.

Hi Raymond

We may can have the fraction part of any angle using magic numbers. In case of dividing a number by 360, it´s the same as using the magical number 3 oor 9, since 360 can be divided by those numbers.

I found a equation that can retrieve the value of the remainder of such divisions to be used to calculate the equivalent angle. I don´t know yet, how to make it work with SSE on the same functionality as we do for regular asm x86 as in here: https://masm32.com/board/index.php?topic=1906.15

The math equation to calculate tthe remainder of such divisions is described here:
https://codegolf.stackexchange.com/questions/243840/find-the-magic-numbers-to-divide-a-number-without-division?newreg=131c0f9bcfbb4d0ebacf0ef7f4c9c626

Any angle can be divided by a magic number of 3 or 9, all we needed to do 1st is divide the original angle by 40 ) if we are using to get the remainder, a magic number of 9, since 360/40 = 9), or divide by 120 if we are using a maic number of 3 (360/120 = 3)

So, HugeNumber/magic divider (either resultant from division by 3 or 9) to we then find the remainder.

Even for huge numbers, no matter what the number is, it will always have a measurable remainder. Of course, we need to take onto acount the limitations of he Real8 values to be stored, because if we input values such as 1.234e19, in fact, we are inputing only a integer 1234e16 and so on. But, even if we consider that after the 14th (or 17th) digit, all the other digits are 0, we still can calculate their values, but in their truncated form. So, we can do things like 9.45445481112182e16 = 94544548111218.2e1. Where the last "2" is the actual remainder. Buut for bigger values,it don´t means we don´t have a remaidner, it means only that for limitations of the 64 bits, all other values above a certain limit will be filled with 0. So, 9.45445481112182e38 = 9454454811121820000000....... Truncating at the 37th dighit, all ohers are naturally 0

Nevertheless, the encoding is also true. When we put a hge number to be stored on a 64 bite register or variable, all fractions exceeding a limit are discarded. Ex: we can´t enconde things like 1445454545487512714581781474847847841784178. They will be truncated at a certain digit (17th or 14th etc).

But all of this, don´t means that we cannot calculaate equivalent angles of huge numbers (even with thei limitations)

For SSE equivalent algorithm, i would need a equivalent of doing things like shr eax 1 to multiply by 2, shr eax 2 to multply by 4 and so on, or also shl eax 1 etc etc, but i can´t make it work with psrldq xmm0 1, psrad xmm0 1 etc

xorpd xmm1 xmm1
cvtpd2dq xmm1 xmm1

; emulate shr eax 1
mov eax 1
CVTSI2SD xmm1 eax ; put it in eax
cvtpd2dq xmm1 xmm1 ; now we have 00__0000_0000_0000_0000_0000_0000_0000_0001 in the xmm register
PSLLD xmm1 1 ; shift it left by 1 ; now we have 00__0000_0000_0000_0000_0000_0000_0000_0010 in the xmm register

CVTPS2PD xmm1 xmm1 ; ??? How to convert back from binary to int (or double) representation ?

For instance, the magic number of a division by 3 in regular x86 is calculated as:
Magic Number = 31 - log(x)/log(2) , where x= divisor

for 64 bits it is:
Magic Number = 63 - log(x)/log(2) , where x= divisor

The equation is described as:

Given an integer d, find a valid tuple (m, s, f) such that the following formula computes the correct value of q for all x:

y = mx >> 32
q = y + ((x - y) >> 1 & -f) >> s

The formula works by first dividing x by 2^32, which is the largest power of 2 that is less than or equal to d. The result of this division is y. Then, the formula subtracts y from x and divides the result by 2. The result of this division is a number between 0 and 2^31 - 1 (or 2^63-1 for 64 bits). The formula then takes the remainder of this division and shifts it to the left by s bits. Finally, the formula adds f to the result and shifts it to the right by s bits.

The value of f is either 0 or 1. If f is 0, then the formula will round the result down to the nearest integer. If f is 1, then the formula will round the result up to the nearest integer.

The value of s is an integer between 0 and 31 (or 0 and 63 for 64 bits). The value of s determines how many bits of precision the formula uses. A larger value of s will result in a more accurate result, but it will also make the formula slower.

The value of m is an integer between 2 and 2^32 - 1 (or 2^64-1). The value of m is used to scale the result of the formula. A larger value of m will result in a larger result, but it will also make the formula less accurate.

The equation tries to find a valid tuple (m, s, f) for a given value of d by using a brute-force search. The code starts with m = 2 and s = 0, and then it increments m and s until it finds a tuple that works.

guga · July 25, 2023, 12:28:15 AM

Quote from: daydreamer on July 24, 2023, 06:35:46 PMIf you use convert to 65536 instead of 360 ,and ebx,0fffh is faster than modulo
And can use ebx pointing to atan lut

Hi daydreamer, for normal x86 (integers), it can be possible, but can we do the same for SSE 2 ? I mean, using magic number division as i explained above ?

HSE · July 25, 2023, 12:40:03 AM

Quote from: guga on July 25, 2023, 12:10:57 AMBut all of this, don´t means that we cannot calculaate equivalent angles of huge numbers

How do you know that the number have zeros after represented part? Because in big numbers the angle depends of the part it's not there.

Quote from: raymond on July 24, 2023, 02:02:38 AMI know that I'm probably wasting my time, BUT

jj2007 · July 25, 2023, 01:27:18 AM

Quote from: guga on July 25, 2023, 12:28:15 AM
Quote from: daydreamer on July 24, 2023, 06:35:46 PMIf you use convert to 65536 instead of 360 ,and ebx,0fffh is faster than modulo
And can use ebx pointing to atan lut

Hi daydreamer, for normal x86 (integers), it can be possible, but can we do the same for SSE 2 ? I mean, using magic number division as i explained above ?

Almost everything can be done with SIMD instructions, but it's better to test it ;-)

Code Select

Using the FPU:
8.704100e+02 = 150.4
1.234560e+06 = 120.0
1.234560e+07 = 120.0
1.234560e+08 = 120.0
1.234560e+09 = 120.0
1.234560e+10 = 120.0

Using SIMD instructions, mode 1 (roundsd):
8.704100e+02 = 150.4
1.234560e+06 = 120.0
1.234560e+07 = 120.0
1.234560e+08 = 120.1
1.234560e+09 = 121.0
1.234560e+10 = 129.9

Using SIMD instructions, mode 2 (and 65536):
8.704100e+02 = 150.4
1.234560e+06 = 120.0
1.234560e+07 = 0
1.234560e+08 = 0
1.234560e+09 = 0
1.234560e+10 = 0

The MASM Forum

News:

Equivalence angle conversion in SSE2

guga

daydreamer

guga

TimoVJL

jj2007

TimoVJL

guga

jj2007

InfiniteLoop

raymond

daydreamer

guga

guga

HSE

jj2007