Atan2 SSE2

guga · February 11, 2022, 04:04:10 AM

Hi JJ.
Many Tks. I´ll take a look

(Post edited), i didn´t saw the file you uploaded when i was writing this comment.

guga · February 11, 2022, 04:41:31 AM

My results with the new version. A bit faster and accurated


31 us on average for creating FastMath
AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

5123    cycles for 100 * SSE_Atan2D_new
67818   cycles for 100 * ucrtbase atan2
14832   cycles for 100 * fpatan
1479    cycles for 100 * FastAtan

5166    cycles for 100 * SSE_Atan2D_new
67255   cycles for 100 * ucrtbase atan2
14665   cycles for 100 * fpatan
1523    cycles for 100 * FastAtan

5029    cycles for 100 * SSE_Atan2D_new
68465   cycles for 100 * ucrtbase atan2
15228   cycles for 100 * fpatan
1766    cycles for 100 * FastAtan

5076    cycles for 100 * SSE_Atan2D_new
67520   cycles for 100 * ucrtbase atan2
14604   cycles for 100 * fpatan
1568    cycles for 100 * FastAtan

49      bytes for SSE_Atan2D_new
51      bytes for ucrtbase atan2
30      bytes for fpatan
193     bytes for FastAtan

Real8   -56.30993247402021495   SSE_Atan2D_new
Real8   -56.30993247402021495   ucrtbase atan2
Real8   -56.30993247402021495   fpatan
Real8   -56.30993247402021495   FastAtan

Btw, i´m impressed about the results of FastAtan. I analysed it in the disassembler and got a bit clueless how it is showing the correct result since it uses only 1 parameter (value: 33) and not the other one. The FastMath library is being applied before the computations, right ? Because at

Code Select


.text:00401201 UsedInFastAtan:                         ; CODE XREF: MainRoutine+418↓j
.text:00401201                 fld     tbyte_408068 <- 0.22 =  22/100 (Our 1st parameter Y) ?
.text:00401207                 fld     tbyte_409992
.text:0040120D                 faddp   st(1), st
.text:0040120F                 fstp    tbyte_409992
.text:00401215                 push    ecx
.text:00401216                 fxsave  dword_409B50
.text:0040121D                 push    0               ; dwMilliseconds
.text:0040121F                 call    Sleep
.text:00401224                 push    offset PerformanceCount ; lpPerformanceCount
.text:00401229                 call    QueryPerformanceCounter
.text:0040122E                 fxrstor dword_409B50
.text:00401235                 pop     ecx
.text:00401236                 push    ecx
.text:00401237                 push    edi
.text:00401238                 push    0C3500h         ; dwBytes
.text:0040123D                 push    4               ; dwFlags
.text:0040123F                 push    hHeap           ; hHeap
.text:00401245                 call    HeapAlloc
.text:0040124A                 mov     edi, eax
.text:0040124C                 mov     dword_4098E0, edi
.text:00401252                 push    edi
.text:00401253                 add     edi, 61A80h
.text:00401259                 push    edi
.text:0040125A                 ffree   st(7)
.text:0040125C                 fld     tbyte_40807C
.text:00401262                 ffree   st(7)
.text:00401264                 fld     st
.text:00401266                 fstp    tbyte_409988
.text:0040126C                 fld     tbyte_408072
.text:00401272                 fsubrp  st(1), st
.text:00401274                 fld     tbyte_408086
.text:0040127A                 fdivp   st(1), st
.text:0040127C                 mov     edx, offset dword_4099A4
.text:00401281                 fmul    dbl_408090
.text:00401287                 fisttp  dword ptr [edx]
.text:00401289                 fld     tbyte_408086
.text:0040128F                 fimul   dword ptr [edx]
.text:00401291                 fld     tbyte_409988
.text:00401297                 faddp   st(1), st
.text:00401299                 fstp    tbyte_4099FA
.text:0040129F                 cmp     dword_4099A4, 0
.text:004012A6                 js      short loc_4012EA
.text:004012A8
.text:004012A8 loc_4012A8:                             ; CODE XREF: MainRoutine+2C4↓j
.text:004012A8                 fld     tbyte_409988
.text:004012AE                 fstp    tbyte ptr [edi]
.text:004012B0                 fld     tbyte_409988
.text:004012B6                 fld     tbyte_409992
.text:004012BC                 fpatan <---------------Prwcalculated

if i understood correctly, all the computation also uses fpatan opcode, but it is precalculated before the calculations of the timming. So, for the resultant timming be more acccurated, wouldn´t be necessary to add to the final timming of "void FastAtan(-33.0)", the timming it took to precalculate the data. I.e: the fpatan timming as well ?

guga · February 11, 2022, 05:59:40 AM

About the usage. A faster Atan2 is needed to compute more data related to a FFT from audio samples. I'm testing Marinus FirAnalyser and FFTAnalyser in order to create a app that can export ta audio filtered with Fir algorithm or other sort of algorithm that uses FFT too.

http://masm32.com/board/index.php?topic=4231.135
http://masm32.com/board/index.php?topic=3568.msg39209#msg39209
http://masm32.com/board/index.php?topic=3665.15

I´m currently trying to finish a routine to save/export an audio filtered with Marinu´s Fir Algorithm. Later i´ll try to do a similar spectrogram as he did using FFT routines, but instead only calculating the Real or Imaginary frequencies, i wanted to export everything related to an audio file/sample, such as amplitude, phase (that uses atan2 function), pitch identification, Sound Intensity, Sound Power level and finally estimate the distance of the sound and the microphone that recorded it and created the sample (estimating the distance of each frequency can be handy to isolate them or identify backgrounds etc etc).

So, input = audio sample
Output = A structure containing everything related to that sample (or sequence of samples).

Doing that, with all information related to a certain audio file, we can build better tools to identify pitch, noise, isolating vocals etc etc using techniques similar to facebook demucs, or facebook wave2text etc etc. Or also creating a tool that can act as a photoshop, but for audio. example, once we know the frequencies the generated data is placed on a Spectrogram (such as the one existent in Isotope RX using Mel Scale), then we can simply work on the audio file on the same way we do for images. We can identify vocals using only the generated image rather then the heavy computations that involves audio etc

His FFT analyzer has a nice spectrogram but i wanted something like this

Isotope is an extraordinary audio editing tool, but lack some features, such as phase identification, and sound separation (such as facebook´s one - demucs) etc etc

jj2007 · February 11, 2022, 06:26:54 AM

Quote from: guga on February 11, 2022, 04:41:31 AMif i understood correctly, all the computation also uses fpatan opcode, but it is precalculated before the calculations of the timming. So, for the resultant timming be more acccurated, wouldn´t be necessary to add to the final timming of "void FastAtan(-33.0)", the timming it took to precalculate the data. I.e: the fpatan timming as well ?

"31 µs on average for creating FastMath"

That's why I asked (repeatedly) what your requirements are. If there are two nested loops, with X as the first and Y as the second counter, then the outer loop could re-create the FastMath table (at 12µs on my, 30µs on yours), while the inner loop just uses it.

guga · February 15, 2022, 10:27:40 AM

Hi Guys

I succeeded to find the mathematical equation behind the Atan2 function. It uses a variation of a Taylor Series on the 4th degree, and seems to use some fractions derived from a polynomial calculation (Chebyshev ?)

The polynomial used is this:

0.9980475428149416 * (Y^9)/9 - 0.9999987497055115 * (Y^7)/7 + 0.9999999997101690 * (Y^5)/5 - 0.9999999999999822 * (Y^3)/3 + Y

Where Y = A value resultant from the division of xmm0/xmm4

The divisor is calculated as:
-k + (1 + k^2)/(k + z)

where
k = X/Y = Inputed XValue / Inputed YValue
z = The position on the table that decrements each 16 values. This "position" is, in fact a arctangent of a value starting from 32 to 0.029296875000 (From bottom to top on the table)

I´m making some tests using JJ´s technique he did sometime ago on the binary search algo.

I´ll later try to make further tests before implement a newer routine to check for speed. So, far, i succeeded to remove several repeated usage of xmm registers. I´ll try to limit the number of registers, once i suceed to decode this Taylor series and see if it will work on all situations.

guga · February 20, 2022, 01:37:11 AM

Hi Guys

I succeeded to simplify the math , but i have no idea, why the original version is faster then the one i´m testing. Since it is using less registers and i removed some opcodes, it should be faster. But, for some reason, the original one, is achieving around 5000 clocks only, while the other version i´m testing has only 6500 clocks, even using less registers. Both results are ok, but, what i´m doing wrong ? Why this new version is slower even using less registers and less opcodes ? The only difference in the new version is on a routine, i called Sse2_atan2_ComputePolynomial. Is there a cache issue in SSE2 registers that i´m, not aware ?

Btw: I also tested with JJ´s Binary search routine, but it was unsuccessful. The speed results is the same as he version that does not use the loop.

The original version of thee routine is this:

Code Select


Proc Sse2_atan2_ComputePolynomial:
    Arguments @pY, @pX
    Local @StartBlockIndexX, @StartBlockIndexY, @MaskedIndex
    Uses ecx

    mov eax D@pY | movupd xmm0 X$eax; TestingAtan2Data1
    mov eax D@pX | movupd xmm1 X$eax; TestingAtan2Data2

    unpcklpd xmm0 xmm1 ; xmm1 = SSEAtanTmpVal2
    movmskpd eax xmm0 | add eax eax | mov D@MaskedIndex eax

    xorpd xmm3 xmm3
    movupd xmm5 X$Atan2_DivPower895 ; 1/2^895 same as: xorpd xmm5 xmm5 | mov eax 0800 | pinsrw xmm5 eax 3
    paddw xmm5 xmm1
    psrlq xmm5 29
    rcpss xmm3 xmm5
    psllq xmm3 29

    paddw xmm3 X$Atan2_DivPower127 ; xmm3+(1/(2^127)) Warning. The data at this variable must be aligned to 16 bytes
    mulsd xmm3 xmm0 ; Multiply this result with the unpacked data from SSEAtanTmpVal2

    paddd xmm3 X$Atan2_Mask4
    andpd xmm3 X$Atan2_Mask3 ;  get the value of our table = tan(ValueUnknown) = 1.125 etc

    movsd xmm5 xmm3 ; xmm5 used to get the index for out tangent table
    mov eax 08000 | pextrw eax xmm3 3 | and eax 07FF0 | sub eax 03F9E | mov D@StartBlockIndexX eax | add eax 942 | mov D@StartBlockIndexY eax


    minsd xmm3 X$Atan2_Float_32

    SSE_ABS_REAL8 xmm0 ; Convert to positive. This macro is the same as below

    ;psllq xmm0 1
    ;psrlq xmm0 1    ; sign bit of pY was removed (IF any)

    cmplesd xmm5 X$Atan2_Float_32

    SSE_ABS_REAL8 xmm1 ; Convert to positive. This macro is the same as below

;    psllq xmm1 1    ; sign bit of pX was removed (IF any)
;    psrlq xmm1 1 ; contains the value of pX TESTINGATAN2DATA2

    movsd xmm6 xmm1 ; xmm6 = pX
    movsd xmm7 xmm1 ; xmm7 = pX

    movsd xmm4 xmm0 | mulsd xmm4 xmm3 | andpd xmm1 xmm5 | addsd xmm4 xmm1

    mov ecx 0 | pinsrw xmm6 ecx 0 | subsd xmm7 xmm6
    mulsd xmm6 xmm3 | mulsd xmm7 xmm3 | andpd xmm0 xmm5 | subsd xmm0 xmm6
    subsd xmm0 xmm7

    ; TangentTbl2
    .If D@StartBlockIndexX <= 1121 ; used xmm4, xmm0, xmm5, xmm3

        ; Get tangent index and put in ecx
        mov eax D@StartBlockIndexX | pextrw eax xmm5 0 | not eax | and eax 1
        mov ecx D@StartBlockIndexY | pextrw ecx xmm3 3 | sub ecx 03F9E | add ecx eax | add ecx ecx
        movupd xmm5 X$TangentTbl+ecx*8

        mov eax D@MaskedIndex

        movupd xmm6 X$Atan2_P_TBL+eax*8
        movupd xmm1 X$Atan2_SGN_TBL+eax*8
        xorpd xmm5 xmm1 | addpd xmm5 xmm6

        movsd xmm6 xmm5
        unpckhpd xmm5 xmm5

        divsd xmm0 xmm4 ; we are dividing pX/pY if abs(px) < abs(py)
        ; And do the polynomial routine. A Taylor series of 4th degree of the tangent math
        xorpd xmm1 xmm0
        movsd xmm4 xmm1
        mulsd xmm0 xmm0 ; Y = Y^2

        movlpd xmm2 X$Atan2_FactorA9 ; FactorA9
        mulsd xmm2 xmm0 ; A9*Y = FactorA9*Y

        movlpd xmm3 X$Atan2_Limit3
        addsd xmm3 xmm0 ; Atan2_Limit3+Y
        addsd xmm1 xmm6
        subsd xmm6 xmm1
        addsd xmm6 xmm4
        addsd xmm2 X$Atan2_Limit2 ; FactorA9*Y+Atan2_Limit2
        mulsd xmm3 xmm0 ; (Atan2_Limit3+Y)*Y
        mulsd xmm4 xmm0
        addsd xmm6 xmm5
        mulsd xmm2 xmm4
        addsd xmm3 X$Atan2_Limit4 ;  Atan2_Limit4 + (Atan2_Limit3+Y)*Y
        mulsd xmm2 xmm3
        addsd xmm2 xmm6
        addsd xmm1 xmm2 ; xmm1 = -HalfPi, HalfPi or 0
        movupd xmm0 xmm1

    .Else_If D@StartBlockIndexY <= 942;03AE ; Simple_Polynomial used xmm1, xmm3, xmm6, xmm0

        mov eax D@pY | movupd xmm2 X$eax | SSE_ABS_REAL8 xmm2

        movupd xmm4 X$Atan2_Float_One ; same as: xorpd xmm4 xmm4 | mov ecx 03FF0 | pinsrw xmm4 ecx 3
        divsd xmm4 xmm1

        mov eax D@MaskedIndex
        movupd xmm6 X$Atan2_SGN_TBL+eax*8
        unpcklpd xmm3 xmm3
        xorpd xmm0 xmm6
        xorpd xmm2 xmm6
        xorpd xmm3 xmm6
        movupd xmm7 X$Atan2_P_TBL2+eax*8
        movlpd xmm1 X$Atan2_FactorA9
        movlpd xmm5 X$Atan2_Limit3
        andpd xmm3 X$Atan2_SELECT_B+eax*8
        mulsd xmm2 xmm4
        mulsd xmm0 xmm4
        movsd xmm6 xmm2
        mulsd xmm2 xmm2
        mulsd xmm1 xmm2
        addsd xmm5 xmm2
        mulsd xmm6 xmm2
        addsd xmm1 X$Atan2_Limit2
        mulsd xmm5 xmm2
        addsd xmm7 xmm0
        addpd xmm7 xmm3
        mulsd xmm1 xmm6
        addsd xmm5 X$Atan2_Limit4
        mulsd xmm5 xmm1
        addsd xmm5 xmm7
        unpckhpd xmm7 xmm7
        addsd xmm5 xmm7
        movupd xmm0 xmm5

    .Else ; Big_or_Small
        call Sse2_atan2_Big_or_Small
    .End_If

EndP

Newer version is:

Code Select


Proc Sse2_atan2_ComputePolynomial:
    Arguments @pY, @pX
    Local @StartBlockIndexX, @StartBlockIndexY, @PosInTable
    Uses esi, ecx

    movupd xmm1 X$Atan2_DivPower895 ; 1/2^895 same as: xorpd xmm5 xmm5 | mov eax 0800 | pinsrw xmm5 eax 3
    mov eax D@pX | movupd xmm0 X$eax
    paddw xmm1 xmm0
    psrlq xmm1 29
    xorpd xmm2 xmm2
    rcpss xmm2 xmm1
    psllq xmm2 29

    paddw xmm2 X$Atan2_DivPower127 ; xmm3+(1/(2^127)) Warning. The data at this variable must be aligned to 16 bytes
    mov eax D@pY | movupd xmm0 X$eax
    mulsd xmm2 xmm0 ; Multiply this result with the unpacked data from SSEAtanTmpVal2

    paddd xmm2 X$Atan2_Mask4
    andpd xmm2 X$Atan2_Mask3 ;  get the value of our table = tan(ValueUnknown) = 1.125 etc

    movsd xmm1 xmm2 ; xmm5 used to get the index for out tangent table
    pextrw eax xmm2 3 | and eax 07FF0 | sub eax 16286 | mov D@StartBlockIndexX eax | add eax 942 | mov D@StartBlockIndexY eax

    minsd xmm2 X$Atan2_Float_32

    cmplesd xmm1 X$Atan2_Float_32

    .If D@StartBlockIndexX <= 1121
        ; Get tangent index and put in ecx

        pextrw eax xmm1 0 | not eax | and eax 1
        pextrw ecx xmm2 3 | sub ecx 16286 | add ecx eax | add ecx ecx | mov D@PosInTable ecx

        ; we already have Y in xmm0, so, let´s put X on xmm1
        mov eax D@pX | movupd xmm1 X$eax

        ; get our masked index
        unpcklpd xmm0 xmm1 | movmskpd ecx xmm0 | add ecx ecx; | mov D@MaskedIndex eax

        SSE_ABS_REAL8 xmm0 ; Turn Y in Positive
        SSE_ABS_REAL8 xmm1 ; Turn X in Positive
        ; xmm2 already have our tangent pos index
        pshufd xmm2 xmm2 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword
        mulsd xmm2 xmm0 ; Y*z
        addsd xmm2 xmm1 ; X+Y*z
        pshufd xmm2 xmm2 SSE_SWAP_QWORDS ; swap (X + Y z)in xmm2, to we reuse it
        mulsd xmm1 xmm2; X*z
        subsd xmm0 xmm1; Y - X z
        movhlps xmm1 xmm2 ; copy (X + Y z) in the hiqword of xmm2 to low Qword in xmm1
        divsd xmm0 xmm1 ; finally get our divisor

        ; ----------------------- Start Calculating Polynomial - Taylor Series of Atan
        pshufd XMM0 XMM0 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword =    PSHUFD XMM0 XMM0 044h
        mulsd xmm0 xmm0 ; the lowquadword will be a power of our divisor Y^4

        movsd xmm1 xmm0 ; Copy the double Y to xmm1
        ; 1St Step. Calculate C = (0.1108941714238824 Y^2 - 0.1428569642436445)*Y^2
        mulsd xmm1 X$Atan2_FactorA9
        addsd xmm1 X$Atan2PolynomialFactor1A
        mulsd xmm1 xmm0

        ; 2nd step. Calculate B = Y^2*(C+0.1999999999420338)
        addsd xmm1 X$Atan2PolynomialFactor2A
        mulsd xmm1 xmm0

        ; 3rd step. Calculate A = Y^2*(B - 0.3333333333333274)
        addsd xmm1 X$Atan2PolynomialFactor3A
        mulsd xmm1 xmm0

        ; 4th step. Calculate Y*(A+1)
        pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap Y <-- Y^2 in xmm0 =  PSHUFD XMM0 XMM0 78 (decimal)
        addsd xmm1 X$Atan2_Float_One
        mulsd xmm0 xmm1 ; result in xmm0

        mov eax D@PosInTable
        addsd xmm0 X$TangentTbl+eax*8+TangenTbl.MainValueDis
        addsd xmm0 X$TangentTbl+eax*8+TangenTbl.RemainderValueDis

        ; Now we can safelly apply our mask TestingAtan2Data1
        xorpd xmm0 X$Atan2_SGN_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8+8


    .Else_If D@StartBlockIndexY <= 942

        movsd xmm2 X$Atan2ClosetoZeroIndex ;esi+TangenTbl.TangPosDis = 0.029296875 (Fixed value when the value is closer to zero or near Pi)
        ; got our true pointer

        ; No need for index here. We need only to divide x/y
        ; we already have Y in xmm0, so, let´s put X on xmm1
        movups xmm1 X$Atan2_Float_One
        mov eax D@pX | movupd xmm1 X$eax ; Get X and copy it to xmm0
        SSE_ABS_REAL8 xmm0 ; Turn Y in Positive
        SSE_ABS_REAL8 xmm1 ; Turn X in Positive
        divsd xmm0 xmm1

        ; get our masked index
        unpcklpd xmm0 xmm1 | movmskpd ecx xmm0 | add ecx ecx; | mov D@MaskedIndex eax

        ; ----------------------- Start Calculating Polynomial - Taylor Series of Atan
        pshufd XMM0 XMM0 {SHUFFLE 1,0,1,0} ; we are simply coying from LowQuadrord to Hi Quaddword
        mulsd xmm0 xmm0 ; the lowquadword will be a power of our divisor Y^4
        ; -9.05387052965142591e-4   8.19725715676905915381763684962193281e-7

        movsd xmm1 xmm0 ; Copy the double Y to xmm1
        ; 1St Step. Calculate C = (0.1108941714238824 Y^2 - 0.1428569642436445)*Y^2
        mulsd xmm1 X$Atan2_FactorA9 | addsd xmm1 X$Atan2PolynomialFactor1A | mulsd xmm1 xmm0

        ; 2nd step. Calculate B = Y^2*(C+0.1999999999420338)
        addsd xmm1 X$Atan2PolynomialFactor2A | mulsd xmm1 xmm0

        ; 3rd step. Calculate A = Y^2*(B - 0.3333333333333274)
        addsd xmm1 X$Atan2PolynomialFactor3A | mulsd xmm1 xmm0

        ; 4th step. Calculate Y*(A+1)
        pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap Y <-- Y^2 in xmm0
        addsd xmm1 X$Atan2_Float_One | mulsd xmm0 xmm1 ; result in xmm0

        xorpd xmm0 X$Atan2_SGN_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8 | addsd xmm0 X$Atan2_P_TBLA+ecx*8+8


    .Else ; Big_or_Small
        call Sse2_atan2_Big_or_Small
    .End_If

EndP

jj2007 · February 20, 2022, 10:57:27 AM

Hi Guga,

I'm a bit lost - what do you mean with "JJ´s Binary search routine"?

guga · February 20, 2022, 06:08:31 PM

Quote from: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,

I'm a bit lost - what do you mean with "JJ´s Binary search routine"?

Hi JJ

This one

http://masm32.com/board/index.php?topic=7659.msg83723#msg83723

You helped some time ago on a image routine i was testing.

jj2007 · February 20, 2022, 09:00:15 PM

Quote from: guga on February 20, 2022, 06:08:31 PM
Quote from: jj2007 on February 20, 2022, 10:57:27 AM
Hi Guga,

I'm a bit lost - what do you mean with "JJ´s Binary search routine"?

Hi JJ

This one

http://masm32.com/board/index.php?topic=7659.msg83723#msg83723

You helped some time ago on a image routine i was testing.

Yes, now I remember. It's called ArrayIndex now.

HSE · February 21, 2022, 12:05:14 AM

Hi Guga!

Quote from: guga on February 20, 2022, 01:37:11 AM
Since it is using less registers and i removed some opcodes, it should be faster.

That is a bad assumption!

See instruccion timing.

Regards, HSE.

guga · February 23, 2022, 01:00:03 PM

Quote from: HSE on February 21, 2022, 12:05:14 AM
Hi Guga!

Quote from: guga on February 20, 2022, 01:37:11 AM
Since it is using less registers and i removed some opcodes, it should be faster.

That is a bad assumption!

See instruccion timing.

Regards, HSE.

Indeed. You are right. Tks, HSE

I´ll need to stay with the original version instead. Or...another solution is use the LolRemez polynomials to speed it up a bit, but i´m not sure about accuracy yet. I´ll give a try converting aa app made in FreeBasic that do this LolRemez polynomials to be studied. The only problem is that i´ll also need to try to create a dll for FreeBasic to this thing works properly (Ouch.). If i see that the FreeBasic converstion will take too many time, i´ll give up and try to create a Lolremez app by scratch to test the accuracy for the algorithm. Btw..a LolRemez version for masm (or RosAsm, Fasm, nasm etc) is utterly needed for we test other ways to create faster routines that uses complex math.

jj2007 · February 23, 2022, 10:53:37 PM

Quote from: guga on February 23, 2022, 01:00:03 PMcreate faster routines that uses complex math.

If you give me examples what you need, I can try with FastMath().

guga · March 02, 2022, 08:23:25 AM

Hu JJ

I don´t remember right now any working example. I´ll take a look later, ok ? The goal is create a tool that allow us to do complex calculations on a faster way using this lolremez technique (or other) that can result on approximation values, like the ones we discussed some time ago with Siekmansk at: http://masm32.com/board/index.php?topic=8734.msg95511#msg95511

So, find the polynomials values for Exp(x) = Exp2(x / Logn(2.0)) etc or more complex maths like:

(x^2/log(y)) + sin(x)*atan2(y) etc etc

With lolremez we can produce things like:

Code Select


lolremez -d 10 -r 0:1 "2**x" "2**x"

// Approximation of f(x) = 2**x
// with weight function g(x) = 2**x
// on interval [ 0, 1 ]
// with a polynomial of degree 10.
double f(double x)
{
    double u = 9.9522144894077596e-9;
    u = u * x + 9.4609455637620191e-8;
    u = u * x + 1.331271998884894e-6;
    u = u * x + 1.5244659656760603e-5;
    u = u * x + 1.5403957490414841e-4;
    u = u * x + 1.3333543730974749e-3;
    u = u * x + 9.6181294091749148e-3;
    u = u * x + 5.5504108628244985e-2;
    u = u * x + 2.402265069613608e-1;
    u = u * x + 6.9314718055989127e-1;
    return u * x + 1.0000000000000002;
}

.const
Chebylog2E      real4 4 dup (1.44269504089) ; 1/Logn(2.0)

Cheby10Exp0  	real4 4 dup (9.9522144894077596e-9)
Cheby10Exp1  	real4 4 dup (9.4609455637620191e-8)
Cheby10Exp2  	real4 4 dup (1.331271998884894e-6)
Cheby10Exp3  	real4 4 dup (1.5244659656760603e-5)
Cheby10Exp4  	real4 4 dup (1.5403957490414841e-4)
Cheby10Exp5  	real4 4 dup (1.3333543730974749e-3)
Cheby10Exp6  	real4 4 dup (9.6181294091749148e-3)
Cheby10Exp7  	real4 4 dup (5.5504108628244985e-2)
Cheby10Exp8  	real4 4 dup (2.402265069613608e-1)
Cheby10Exp9  	real4 4 dup (6.9314718055989127e-1)
Cheby10Exp10  	real4 4 dup (1.0000000000000002)


.code
align 16
SSE2_Exp_10:  ; in: xmm0, out: xmm1
    movaps      xmm2,oword ptr Chebylog2E
    mulps       xmm2,xmm0
    psrld       xmm0,31
    cvttps2dq   xmm1,xmm2
    psubd       xmm1,xmm0
    movdqa      xmm0,xmm1
    cvtdq2ps    xmm1,xmm1
    subps       xmm2,xmm1
    movaps      xmm1,oword ptr Cheby10Exp0
    pslld       xmm0,23
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp1
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp2
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp3
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp4
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp5
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp6
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp7
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp8
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp9
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr Cheby10Exp10
    paddd       xmm0,xmm1
    ret

So, a app where we can input a math formula and result the full code with the polynomials already computed can be very handy

jj2007 · March 02, 2022, 08:29:15 AM

If you can bring (x^2/log(y)) + sin(x)*atan2(y) into the form y=f(x), I can help you (I'm not good at math)

guga · March 02, 2022, 09:09:07 AM

Let´s try it on a more simple function , then.

Try this:

sin(x)+cos(x)

which is the same as:

sqrt(2) * sin(x + Pi/4)

On this example, LolRemez app returned:

Code Select


G:\HD1_Fino\BackupHD IDE - ok\RosAsm\RosAsm\Development\Dlls\New\LolRemezOld>lolremez.exe sin(x)+cos(x)
// Approximation of f(x) = sin(x)+cos(x)
// on interval [ -1, 1 ]
// with a polynomial of degree 4.
// p(x)=(((3.9297499716796932e-2*x-1.5653249505225839e-1)*x-4.9891262416895649e-1)*x+9.9750048049840423e-1)*x+9.9991743032029927e-1
double f(double x)
{
    double u = 3.9297499716796932e-2;
    u = u * x + -1.5653249505225839e-1;
    u = u * x + -4.9891262416895649e-1;
    u = u * x + 9.9750048049840423e-1;
    return u * x + 9.9991743032029927e-1;
}

G:\HD1_Fino\BackupHD IDE - ok\RosAsm\RosAsm\Development\Dlls\New\LolRemezOld>pause
Press any key to continue . . .

The MASM Forum

News:

Atan2 SSE2

guga

guga

guga

jj2007

guga

guga

jj2007

guga

jj2007

HSE

guga

jj2007

guga

jj2007

guga