News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

1/x timings for FPU and SIMD code

Started by jj2007, June 23, 2018, 05:22:40 AM

Previous topic - Next topic

jj2007

Normally, the FPU is not slower than equivalent SIMD code, but 1/x beats it:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3168    cycles for 1000 * rcpss
13049   cycles for 1000 * fdiv

3167    cycles for 1000 * rcpss
13135   cycles for 1000 * fdiv

3181    cycles for 1000 * rcpss
13259   cycles for 1000 * fdiv

3189    cycles for 1000 * rcpss
13092   cycles for 1000 * fdiv

3156    cycles for 1000 * rcpss
13070   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000


Of course, precision is lower; the expected value is 123456789.0

The source:
NameA equ rcpss ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1 ; loop 1000x
  push 123456789
  fild stack
  fstp stack
  pop eax
  movd xmm0, eax
  align 4
  .Repeat
rcpss xmm0, xmm0
dec ebx
  .Until Sign?
  movd eax, xmm0
  ret
TestA endp

align_64
NameB equ fdiv ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1 ; loop 1000x
  push 123456789
  fild stack
  fstp stack
  fld1
  align 4
  .Repeat
fld stack
fdiv ST, ST(1)
fstp stack
dec ebx
  .Until Sign?
  fstp st
  pop eax
  ret
TestB endp

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3868    cycles for 1000 * rcpss
15813   cycles for 1000 * fdiv

3868    cycles for 1000 * rcpss
15826   cycles for 1000 * fdiv

3874    cycles for 1000 * rcpss
15808   cycles for 1000 * fdiv

3875    cycles for 1000 * rcpss
15808   cycles for 1000 * fdiv

3868    cycles for 1000 * rcpss
15821   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000
Creative coders use backward thinking techniques as a strategy.

zedd151


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

2564    cycles for 1000 * rcpss
17517   cycles for 1000 * fdiv

2552    cycles for 1000 * rcpss
16796   cycles for 1000 * fdiv

2822    cycles for 1000 * rcpss
16735   cycles for 1000 * fdiv

2566    cycles for 1000 * rcpss
16124   cycles for 1000 * fdiv

2636    cycles for 1000 * rcpss
16808   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---


1.60 Ghz as usual

Yuri


Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

928     cycles for 1000 * rcpss
12260   cycles for 1000 * fdiv

903     cycles for 1000 * rcpss
12142   cycles for 1000 * fdiv

889     cycles for 1000 * rcpss
12072   cycles for 1000 * fdiv

893     cycles for 1000 * rcpss
12114   cycles for 1000 * fdiv

894     cycles for 1000 * rcpss
12059   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---

jimg

It shouldn't be a surprise since you're executing three instructions each loop for the fpu vs. one instruction for simd

Mikl__

Hi,jj2007!
fild stack
at first I was even delighted with the non-standard appeal to the top of the FPU, but
tut_02.asm(8) : error A2006:undefined symbol : stack

jj2007

signed equ sdword ptr
stack equ <DWord Ptr [esp]>
push 9  ; 10 iterations
.Repeat
  ... do something ...
  dec stack
.Until Sign? || signed eax<0
pop edx


@JimG: The fdiv makes it slow. It seems rcpss uses a much faster algorithm.

Mikl__

Hi, jj2007!
stack equ <DWord Ptr [esp]>
Mille grazie!

daydreamer

Why dont we compare against rcpps and divps, divpd?, should be more useful results i think
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3238    cycles for 1000 * rcpss
13094   cycles for 1000 * 1/x using fdiv
10453   cycles for 1000 * 1/x using divss

3157    cycles for 1000 * rcpss
13125   cycles for 1000 * 1/x using fdiv
10602   cycles for 1000 * 1/x using divss

3151    cycles for 1000 * rcpss
13064   cycles for 1000 * 1/x using fdiv
10644   cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

  movd xmm2, FP4(1.0)
  align 4
  .Repeat
movaps xmm1, xmm2 ; reload 1.0
divss xmm1, xmm0 ; divide by 123456789
dec ebx
  .Until Sign?

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3864    cycles for 1000 * rcpss
15808   cycles for 1000 * 1/x using fdiv
6211    cycles for 1000 * 1/x using divss

3865    cycles for 1000 * rcpss
15789   cycles for 1000 * 1/x using fdiv
6268    cycles for 1000 * 1/x using divss

3866    cycles for 1000 * rcpss
15797   cycles for 1000 * 1/x using fdiv
6190    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
Creative coders use backward thinking techniques as a strategy.

mineiro

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

2584    cycles for 1000 * rcpss
16274   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

2597    cycles for 1000 * rcpss
16282   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

2587    cycles for 1000 * rcpss
16274   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

zedd151

1.60 Ghz - per usual

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

2353    cycles for 1000 * rcpss
15288   cycles for 1000 * 1/x using fdiv
2070    cycles for 1000 * 1/x using divss

2352    cycles for 1000 * rcpss
15347   cycles for 1000 * 1/x using fdiv
2071    cycles for 1000 * 1/x using divss

2353    cycles for 1000 * rcpss
15283   cycles for 1000 * 1/x using fdiv
2069    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
--- ok ---


Quote from: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we...

Why don't you ...     ...        ...       ...     ...       post your results?    :P

The more, the merrier.   :biggrin:

Yuri


Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

910     cycles for 1000 * rcpss
12488   cycles for 1000 * 1/x using fdiv
12446   cycles for 1000 * 1/x using divss

880     cycles for 1000 * rcpss
12217   cycles for 1000 * 1/x using fdiv
12854   cycles for 1000 * 1/x using divss

1029    cycles for 1000 * rcpss
12393   cycles for 1000 * 1/x using fdiv
12029   cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---

zedd151

These results are widely varying.

fdiv is definitely out, but it's a tossup between rcpss and divss.