Author Topic: 1/x timings for FPU and SIMD code  (Read 680 times)

jj2007

  • Member
  • *****
  • Posts: 8841
  • Assembler is fun ;-)
    • MasmBasic
1/x timings for FPU and SIMD code
« on: June 23, 2018, 05:22:40 AM »
Normally, the FPU is not slower than equivalent SIMD code, but 1/x beats it:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3168    cycles for 1000 * rcpss
13049   cycles for 1000 * fdiv

3167    cycles for 1000 * rcpss
13135   cycles for 1000 * fdiv

3181    cycles for 1000 * rcpss
13259   cycles for 1000 * fdiv

3189    cycles for 1000 * rcpss
13092   cycles for 1000 * fdiv

3156    cycles for 1000 * rcpss
13070   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

Of course, precision is lower; the expected value is 123456789.0

The source:
Code: [Select]
NameA equ rcpss ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1 ; loop 1000x
  push 123456789
  fild stack
  fstp stack
  pop eax
  movd xmm0, eax
  align 4
  .Repeat
rcpss xmm0, xmm0
dec ebx
  .Until Sign?
  movd eax, xmm0
  ret
TestA endp

align_64
NameB equ fdiv ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1 ; loop 1000x
  push 123456789
  fild stack
  fstp stack
  fld1
  align 4
  .Repeat
fld stack
fdiv ST, ST(1)
fstp stack
dec ebx
  .Until Sign?
  fstp st
  pop eax
  ret
TestB endp

Siekmanski

  • Member
  • *****
  • Posts: 1684
Re: 1/x timings for FPU and SIMD code
« Reply #1 on: June 23, 2018, 05:58:07 AM »
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3868    cycles for 1000 * rcpss
15813   cycles for 1000 * fdiv

3868    cycles for 1000 * rcpss
15826   cycles for 1000 * fdiv

3874    cycles for 1000 * rcpss
15808   cycles for 1000 * fdiv

3875    cycles for 1000 * rcpss
15808   cycles for 1000 * fdiv

3868    cycles for 1000 * rcpss
15821   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000
Creative coders use backward thinking techniques as a strategy.

zedd151

  • Member
  • ****
  • Posts: 850
Re: 1/x timings for FPU and SIMD code
« Reply #2 on: June 23, 2018, 06:09:51 AM »
Code: [Select]
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

2564    cycles for 1000 * rcpss
17517   cycles for 1000 * fdiv

2552    cycles for 1000 * rcpss
16796   cycles for 1000 * fdiv

2822    cycles for 1000 * rcpss
16735   cycles for 1000 * fdiv

2566    cycles for 1000 * rcpss
16124   cycles for 1000 * fdiv

2636    cycles for 1000 * rcpss
16808   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---

1.60 Ghz as usual
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

Yuri

  • Member
  • **
  • Posts: 173
Re: 1/x timings for FPU and SIMD code
« Reply #3 on: June 23, 2018, 02:29:40 PM »
Code: [Select]
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

928     cycles for 1000 * rcpss
12260   cycles for 1000 * fdiv

903     cycles for 1000 * rcpss
12142   cycles for 1000 * fdiv

889     cycles for 1000 * rcpss
12072   cycles for 1000 * fdiv

893     cycles for 1000 * rcpss
12114   cycles for 1000 * fdiv

894     cycles for 1000 * rcpss
12059   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---

jimg

  • Member
  • ***
  • Posts: 281
Re: 1/x timings for FPU and SIMD code
« Reply #4 on: June 23, 2018, 03:21:00 PM »
It shouldn't be a surprise since you're executing three instructions each loop for the fpu vs. one instruction for simd

Mikl__

  • Member
  • ****
  • Posts: 702
Re: 1/x timings for FPU and SIMD code
« Reply #5 on: June 24, 2018, 10:59:23 AM »
Hi,jj2007!
Code: [Select]
fild stackat first I was even delighted with the non-standard appeal to the top of the FPU, but
Code: [Select]
tut_02.asm(8) : error A2006:undefined symbol : stack

jj2007

  • Member
  • *****
  • Posts: 8841
  • Assembler is fun ;-)
    • MasmBasic
Re: 1/x timings for FPU and SIMD code
« Reply #6 on: June 24, 2018, 02:59:14 PM »
Code: [Select]
signed equ sdword ptr
stack equ <DWord Ptr [esp]>
push 9  ; 10 iterations
.Repeat
  ... do something ...
  dec stack
.Until Sign? || signed eax<0
pop edx

@JimG: The fdiv makes it slow. It seems rcpss uses a much faster algorithm.

Mikl__

  • Member
  • ****
  • Posts: 702
Re: 1/x timings for FPU and SIMD code
« Reply #7 on: June 24, 2018, 03:42:00 PM »
Hi, jj2007!
Code: [Select]
stack equ <DWord Ptr [esp]>Mille grazie!

daydreamer

  • Member
  • ****
  • Posts: 557
  • reach for the stars
Re: 1/x timings for FPU and SIMD code
« Reply #8 on: June 24, 2018, 08:41:32 PM »
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think
Quote from Flashdance
Nick  :  When you give up your dream, you die.
*wears a flameproof asbestos suit*

jj2007

  • Member
  • *****
  • Posts: 8841
  • Assembler is fun ;-)
    • MasmBasic
Re: 1/x timings for FPU and SIMD code
« Reply #9 on: June 24, 2018, 11:54:42 PM »
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3238    cycles for 1000 * rcpss
13094   cycles for 1000 * 1/x using fdiv
10453   cycles for 1000 * 1/x using divss

3157    cycles for 1000 * rcpss
13125   cycles for 1000 * 1/x using fdiv
10602   cycles for 1000 * 1/x using divss

3151    cycles for 1000 * rcpss
13064   cycles for 1000 * 1/x using fdiv
10644   cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
Code: [Select]
  movd xmm2, FP4(1.0)
  align 4
  .Repeat
movaps xmm1, xmm2 ; reload 1.0
divss xmm1, xmm0 ; divide by 123456789
dec ebx
  .Until Sign?

Siekmanski

  • Member
  • *****
  • Posts: 1684
Re: 1/x timings for FPU and SIMD code
« Reply #10 on: June 25, 2018, 12:28:25 AM »
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3864    cycles for 1000 * rcpss
15808   cycles for 1000 * 1/x using fdiv
6211    cycles for 1000 * 1/x using divss

3865    cycles for 1000 * rcpss
15789   cycles for 1000 * 1/x using fdiv
6268    cycles for 1000 * 1/x using divss

3866    cycles for 1000 * rcpss
15797   cycles for 1000 * 1/x using fdiv
6190    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
Creative coders use backward thinking techniques as a strategy.

mineiro

  • Member
  • ***
  • Posts: 450
Re: 1/x timings for FPU and SIMD code
« Reply #11 on: June 25, 2018, 02:07:24 AM »
Code: [Select]
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

2584    cycles for 1000 * rcpss
16274   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

2597    cycles for 1000 * rcpss
16282   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

2587    cycles for 1000 * rcpss
16274   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

zedd151

  • Member
  • ****
  • Posts: 850
Re: 1/x timings for FPU and SIMD code
« Reply #12 on: June 25, 2018, 02:13:21 AM »
1.60 Ghz - per usual
Code: [Select]
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)
 
2353    cycles for 1000 * rcpss
15288   cycles for 1000 * 1/x using fdiv
2070    cycles for 1000 * 1/x using divss
 
2352    cycles for 1000 * rcpss
15347   cycles for 1000 * 1/x using fdiv
2071    cycles for 1000 * 1/x using divss
 
2353    cycles for 1000 * rcpss
15283   cycles for 1000 * 1/x using fdiv
2069    cycles for 1000 * 1/x using divss
 
24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss
 
ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
--- ok ---

Why dont we...

Why don't you ...     ...        ...       ...     ...       post your results?    :P
 
The more, the merrier.   :biggrin:
I'm not always the sharpest knife in the drawer, but I have my moments.  :P

Yuri

  • Member
  • **
  • Posts: 173
Re: 1/x timings for FPU and SIMD code
« Reply #13 on: June 25, 2018, 03:02:08 AM »
Code: [Select]
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

910     cycles for 1000 * rcpss
12488   cycles for 1000 * 1/x using fdiv
12446   cycles for 1000 * 1/x using divss

880     cycles for 1000 * rcpss
12217   cycles for 1000 * 1/x using fdiv
12854   cycles for 1000 * 1/x using divss

1029    cycles for 1000 * rcpss
12393   cycles for 1000 * 1/x using fdiv
12029   cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---

zedd151

  • Member
  • ****
  • Posts: 850
Re: 1/x timings for FPU and SIMD code
« Reply #14 on: June 25, 2018, 03:05:28 AM »
These results are widely varying.
 
fdiv is definitely out, but it's a tossup between rcpss and divss.

I'm not always the sharpest knife in the drawer, but I have my moments.  :P