News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Trigonometry ...

Started by rrr314159, March 26, 2015, 04:48:39 PM

Previous topic - Next topic

FORTRANS

Hi,

   Three sets of results.  Given that these are all Intel, the FPU
results look odd.

3210 cycles fsin (FPU)
7194 cycles fsin (FPU)
17200 cycles fsin (FPU)

  Oh well, good to find out I suppose.

(Old Laptop)

Vendor String: GenuineIntel
Brand  String: Intel(R)Pentium(R)Mprocessor1.70GHz

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

470  cycles SSE2sin Quadrant 1
512  cycles SSE2sin Quadrant 4

551  cycles FPUsin Quadrant 1
583  cycles FPUsin Quadrant 4

623  cycles FPUsin from SSE to FPU to SSE Quadrant 1
642  cycles FPUsin from SSE to FPU to SSE Quadrant 4

491  cycles SSE2cos Quadrant 1
524  cycles SSE2cos Quadrant 4

574  cycles FPUcos Quadrant 1
594  cycles FPUcos Quadrant 4

636  cycles FPUcos from SSE to FPU to SSE Quadrant 1
657  cycles FPUcos from SSE to FPU to SSE Quadrant 4

471  cycles SSE2tan Quadrant 1
511  cycles SSE2tan Quadrant 2

552  cycles FPUtan Quadrant 1
553  cycles FPUtan Quadrant 2

623  cycles FPUtan from SSE to FPU to SSE Quadrant 1
616  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

3210 cycles fsin (FPU)
3202 cycles fcos (FPU)
3187 cycles ftan (FPU)

Press any key to continue...

==================
(Newer Laptop)

Vendor String: GenuineIntel
Brand  String: Intel(R)Core(TM)i3-4005UCPU@1.70GHz

16-bit floating-SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

185  cycles SSE2sin Quadrant 1
210  cycles SSE2sin Quadrant 4

178  cycles FPUsin Quadrant 1
238  cycles FPUsin Quadrant 4

148  cycles FPUsin from SSE to FPU to SSE Quadrant 1
192  cycles FPUsin from SSE to FPU to SSE Quadrant 4

185  cycles SSE2cos Quadrant 1
182  cycles SSE2cos Quadrant 4

134  cycles FPUcos Quadrant 1
167  cycles FPUcos Quadrant 4

161  cycles FPUcos from SSE to FPU to SSE Quadrant 1
205  cycles FPUcos from SSE to FPU to SSE Quadrant 4

185  cycles SSE2tan Quadrant 1
185  cycles SSE2tan Quadrant 2

117  cycles FPUtan Quadrant 1
136  cycles FPUtan Quadrant 2

142  cycles FPUtan from SSE to FPU to SSE Quadrant 1
167  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

7194 cycles fsin (FPU)
7209 cycles fcos (FPU)
7413 cycles ftan (FPU)

Press any key to continue...

                ==================
(Loaner?)
Vendor String: GenuineIntel
Brand  String: Intel(R)Pentium(R)4CPU2.40GHz

SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

418  cycles SSE2sin Quadrant 1
433  cycles SSE2sin Quadrant 4

511  cycles FPUsin Quadrant 1
681  cycles FPUsin Quadrant 4

632  cycles FPUsin from SSE to FPU to SSE Quadrant 1
801  cycles FPUsin from SSE to FPU to SSE Quadrant 4

420  cycles SSE2cos Quadrant 1
713  cycles SSE2cos Quadrant 4

576  cycles FPUcos Quadrant 1
691  cycles FPUcos Quadrant 4

693  cycles FPUcos from SSE to FPU to SSE Quadrant 1
821  cycles FPUcos from SSE to FPU to SSE Quadrant 4

420  cycles SSE2tan Quadrant 1
421  cycles SSE2tan Quadrant 2

514  cycles FPUtan Quadrant 1
552  cycles FPUtan Quadrant 2

637  cycles FPUtan from SSE to FPU to SSE Quadrant 1
670  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

17200 cycles fsin (FPU)
17116 cycles fcos (FPU)
17985 cycles ftan (FPU)

Press any key to continue...


HTH,

Steve N.

rrr314159

Quote from: siekmanskiFunny that ftan is faster than fsin and fcos on your machine.
Or it must be that fsincos is that fast, because i calculated ftan with fsincos and fdivp.

Yes, I often get strange timings with this AMD - don't seem to make sense. I bet nidud would get same odd result. Modern / more powerful AMDs like sinsi's behave more like Intel

Quote from: FORTRANS on April 04, 2015, 02:58:24 AMGiven that these are all Intel, the FPU results look odd.

There are enough odd results here and in many other threads, no matter how you do the timing or what machine you use, to make one take all these figures with a grain of salt
I am NaN ;)

TWell

AMD E1-6010 1,35 GHz
SSE2 Sin Cos Tan routines by Siekmanski 2015.

fsin:    0.19866933 ( FPU )
SSE2sin: 0.19866522 ( SSE2 lut )
FPUsin:  0.19866522 ( FPU lut )

fcos:    0.98006658 ( FPU )
SSE2cos: 0.98005313 ( SSE2 lut )
FPUcos:  0.98005313 ( FPU lut )

ftan:    0.20271004 ( FPU )
SSE2tan: 0.20270567 ( SSE2 lut )
FPUtan:  0.20270567 ( FPU lut )

Start timing, please wait for 5 seconds......

277  cycles SSE2sin Quadrant 1
320  cycles SSE2sin Quadrant 4

205  cycles FPUsin Quadrant 1
281  cycles FPUsin Quadrant 4

233  cycles FPUsin from SSE to FPU to SSE Quadrant 1
330  cycles FPUsin from SSE to FPU to SSE Quadrant 4

274  cycles SSE2cos Quadrant 1
329  cycles SSE2cos Quadrant 4

205  cycles FPUcos Quadrant 1
297  cycles FPUcos Quadrant 4

242  cycles FPUcos from SSE to FPU to SSE Quadrant 1
340  cycles FPUcos from SSE to FPU to SSE Quadrant 4

276  cycles SSE2tan Quadrant 1
280  cycles SSE2tan Quadrant 2

204  cycles FPUtan Quadrant 1
276  cycles FPUtan Quadrant 2

235  cycles FPUtan from SSE to FPU to SSE Quadrant 1
302  cycles FPUtan from SSE to FPU to SSE Quadrant 2


Start FPU timing... please wait, this may take a while ......

1080 cycles fsin (FPU)
1117 cycles fcos (FPU)
701 cycles ftan (FPU)

Press any key to continue...

dedndave

ya know - i was thinking the FPU timings shown were awful slow
are you sure there isn't something amiss ?

similar for many FPU trig functions...

The angle must be expressed in radians and be within the -263 to +263 range.
If the source angle value is outside the acceptable range (but not INFINITY), the C2 flag of the Status Word
is set to 1 and the content of all data registers remains unchanged; the TOP register field of the Status Word
is not modified and no exception is detected.

if that occurs, or if the FPU stack is full (exception?), it would run very slow

rrr314159

Hi dedndave,

All siekmanski's timings are for a REPEAT 10 block. So, divide by 10 to get the times per one calculation. My timings (see previous pages) report one calculation, and I get somewhat more than 100 cycles for an FPU sin or cos. Since we're using these numbers only for comparison it doesn't matter as long as it's apples to apples. Many, I think most, laboratory threads follow the same practice; the numbers they're working so hard to minimize are for a REPEAT block of 10 or 100.

siekmanski or anyone else pls correct me if wrong.

No it's not overflow or any exception - believe me I know when that happens, have plenty of experience! Most common is when you don't balance FPU stack pushes and pops - when repeating 100,000,000 times, or trying to, the condition is unmistakable
I am NaN ;)

rrr314159

Here is a trigSSE SIMD version which does two real8's in xmm0. It uses Default Rounding Mode (no more Floor Truncation), and MichaelW's timing macros. I'm working on a real4 version to do 4 at once, and also YMM AVX capability. This version does Cos as well as Sin, but uses Sin Power Series, so Cos is almost a cycle slower. Need to do a Cos Power Series version to speed that up.

I also made a better FPU version, perhaps I'll post it later. Does only one at a time, of course, a little slower per calculation than this SSE, but if you only need one at a time it's faster.

@qWord, that Jean-Michel Muller algorithm was a good start but it's very slow. When reduced to similar precision as trigSSE, about 4 times slower. Can provide details if anyone's interested - bottom line, he knows his numbers, but don't go to him to learn how to program!

As far as I can tell trigSSE is almost down to 1 nanosecond per calculation on my i5: about 4 cycles. I believe these numbers are pretty accurate of course it's hard to be sure. Here are timings:

Intel i5 3330
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is     162

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    472

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    520

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    365

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    414

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

===================================================================
AMD A6
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is     162

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    933

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    994

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    725

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    824

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005


It does two precisions, about 10^-7 and 10^-5; think I can make those a bit better.

Here is the basic algo, see the zip for supporting constants and other details.

;;*******************************
trigSSE MACRO sincosflag:=<0>       ;; SSE SIMD sincosflag = 0 for sin
;;*******************************
    IF sincosflag
        addpd xmm0, op piover2      ;; converts to cos when sincosflag = 1
    ENDIF

    mulpd xmm0, op oneoverpi        ;; intro, convert to -.5 to .499999
    cvtpd2dq xmm1, xmm0
    cvtdq2pd xmm2, xmm1
    subpd xmm0, xmm2
    pslld xmm1, 31
    pshufd xmm1, xmm1, 073h         ;; convert xmm1 int's to neg masks
    xorpd xmm0, xmm1                ;; chs for left hemisphere
    mulpd xmm0, op pi
   
    movapd xmm2,xmm0
    mulpd xmm2,xmm2
    IF MORE_PRECISION
        movapd xmm1, op cheby_3
        mulpd xmm1, xmm2
        addpd xmm1, op cheby_2
        mulpd xmm1, xmm2
        addpd xmm1, op cheby_1
        mulpd xmm1, xmm2
        addpd xmm1, op cheby_0
        mulpd xmm0, xmm1
    ELSE
        movapd xmm1, op ch2
        mulpd xmm1, xmm2
        addpd xmm1, op ch1
        mulpd xmm1, xmm2
        addpd xmm1, op ch0
        mulpd xmm0, xmm1
    ENDIF
ENDM


Zip contains:

trig_macros.asm: the trigSSE macro and FPU macro
trig.asm: controller, simple to use I hope
test_support_macros: the hard part.
trig.exe
answers.out: 400 calc'ed vals with deltas
dotrig.bat: "MAKEFILE" batch file
I am NaN ;)

rrr314159

@TWell, thanks for the contribution, these AMD's are odd, how can fsincos / fdivp be faster than fsin, which itself is almost 10 X faster than Intel? If u ever figure it out let us know
I am NaN ;)

TWell

AMD E1-6010 1,35 GHz
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is 168

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 1221

Precision (HIGHER)
average precision 3.77e-007
worst precision    5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 1279

Precision (HIGHER)
average precision 3.77e-007
worst precision    5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 928

Precision (FASTER)
average precision 4.33e-005
worst precision    6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 1070

Precision (FASTER)
average precision 4.33e-005
worst precision    6.80e-005

Siekmanski

Steve and Tim, thanks for the input.
Dave, timings are for a REPEAT 10 block. Divide by 10 to get the times per one calculation. ( forgot to mention that. )

rrr314159, Here are the timings:

Intel i7-4930K @3.4GHz
    ----------------------------------------
USING Default Round-to-Nearest
BLANK LOOP total is     179

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    577

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    657

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    496

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    518

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005
Creative coders use backward thinking techniques as a strategy.

Gunther

Hi rrr,

here are the timings:

USING Default Round-to-Nearest
BLANK LOOP total is     142

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    454

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    501

Precision (HIGHER)
average precision       3.77e-007
worst precision         5.97e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    354

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    400

Precision (FASTER)
average precision       4.33e-005
worst precision         6.80e-005


Gunther
You have to know the facts before you can distort them.

rrr314159

Here is the "final" SSE SIMD Sin/Cos algo, which does double (two at once) or single (four at once), selected by the flag "_USEDOUBLE" in trig.asm. There's only one algo, written with macros which switch from double to single according to the flag. For instance the statement "rmul xmm0, xmm1" can mean either mulpd or mulps. The macros are defined in "SSE48.inc":

IF _USEDOUBLE
    WORD_SIZE = 8
ELSE
    WORD_SIZE = 4
ENDIF
WORD_SIZE$ textequ %WORD_SIZE
CHUNK = 16/WORD_SIZE                ; (SSE, no AVX) CHUNK = 2 double, 4 single
%rtype textequ <real>,<&WORD_SIZE$> ; real8 or real4
%rptr textequ <&rtype>,< ptr>     ; use rptr only for vals which change
%tempreal textequ <temp>,<&WORD_SIZE$>
%absmask textequ <absmask>,<&WORD_SIZE$>

scratchxmm equ xmm3

;===============================
IF _USEDOUBLE    ;; for real8 xmm double prec
;===============================
    rmov equ movapd
    radd equ addpd
    rsub equ subpd
    rmul equ mulpd
    rand equ andpd
    rmax equ maxpd
    rhmax MACRO thexmm
pshufd scratchxmm, thexmm, 8dh
rmax thexmm, scratchxmm
    ENDM
    rhadd equ haddpd
    rmovs8 equ movsd ;; scalar move into real8
    rcvtr2i equ cvtpd2dq
    rcvti2r equ cvtdq2pd
    rxor equ xorpd
    rcvt2negmask MACRO thexmm ;; convert xmm1 int's to neg masks
    pslld thexmm, 31
    pshufd thexmm, thexmm, 073h       
    ENDM
;===============================
ELSE ;; for real4 xmm single prec
;===============================
    rmov equ movaps
    radd equ addps
    rsub equ subps
    rmul equ mulps
    rand equ andps
    rmax equ maxps
    rhmax MACRO thexmm
movhlps scratchxmm, thexmm
maxps thexmm, scratchxmm
pshufd scratchxmm, thexmm, 55h
maxps thexmm, scratchxmm
    ENDM
    rhadd MACRO thexmm, thexmm
haddps thexmm, thexmm
haddps thexmm, thexmm
    ENDM
    rmovs8 MACRO thereal, thexmm ;; scalar move low float into real8
movss temp4, thexmm
fld temp4 ;; must go to this trouble apparently
fstp thereal
    ENDM
    rcvtr2i equ cvtps2dq
    rcvti2r equ cvtdq2ps
    rxor equ xorps
    rcvt2negmask MACRO thexmm ;; convert xmm1 int's to neg masks
    pslld thexmm, 31
    ENDM
;===============================
ENDIF
;===============================


Obviously this technique can be applied to any routine, not just sin/cos, and also can be extended to include AVX.

The timing with single precision is under a nanosecond per calc. I actually got one run with less than 1 cycle per sin (!): the FASTER precision (about 5 digits) did 100 in 99 cycles on the Intel:

Intel i5 3330 2.94 Ghz

Sleeping to stabilize timings...-----------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     119

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    163

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    190

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    99

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    121

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005

===================================================================
AMD A6 1.8 Ghz

Sleeping to stabilize timings...-----------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     84

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    316

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    380

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    293

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    336

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005

F:\SSE single to post>


... but what exactly this means is not clear since it depends on the blank loop timing so much; usually it's more like 136 cycles or so; and AMD is close to 300. Anyway, it's pretty fast.

To-do list includes:

- sharpen the coefficients
- AVX version
- Cos Power Series

This basically accomplishes what I wanted from this thread, and I've had enough of sinus problems for the time being, so the to-do list will be put off indefinitely. Thanks to all, especially siekmanski for the LUT, qWord for the Muller algo, and jj2007 for the Chebyshev suggestion. Of course any further comments, timings, bug reports, kudos or complaints are welcome. Well, maybe not complaints.

The zip contains

trig_macros.asm: the SSE single/double routine
trig.asm: controller
test_support_macros.asm: the hard part
SSE48.inc: includes single/double macros
trig.exe
dotrig.bat: "MakeFile" batch file

I am NaN ;)

TWell

AMD E1-6010 1,35 GHz
Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is 80

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 427

Precision (HIGHER)
average precision 3.81e-007
worst precision    8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 465

Precision (HIGHER)
average precision 4.01e-007
worst precision    8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs 369

Precision (FASTER)
average precision 4.38e-005
worst precision    6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs 429

Precision (FASTER)
average precision 4.38e-005
worst precision    6.76e-005

nidud

#87
deleted

Siekmanski

Intel i7-4930K @3.4GHz

Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     140

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    190

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    216

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    117

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    201

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005
Creative coders use backward thinking techniques as a strategy.

dedndave

P4 prescott w/htt
Sleeping to stabilize timings...------------------------------------
REAL4 sin/cos SSE algorithm
BLANK LOOP total is     135

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    902

Precision (HIGHER)
average precision       3.81e-007
worst precision         8.79e-007

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    977

Precision (HIGHER)
average precision       4.01e-007
worst precision         8.05e-007

Test Function: trigSSE Sin ========================================

Cycles for 100 Calcs    737

Precision (FASTER)
average precision       4.38e-005
worst precision         6.79e-005

Test Function: trigSSE Cos ========================================

Cycles for 100 Calcs    839

Precision (FASTER)
average precision       4.38e-005
worst precision         6.76e-005