The MASM Forum

General => The Laboratory => Topic started by: jj2007 on June 23, 2018, 05:22:40 AM

Title: 1/x timings for FPU and SIMD code
Post by: jj2007 on June 23, 2018, 05:22:40 AM
Normally, the FPU is not slower than equivalent SIMD code, but 1/x beats it:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3168    cycles for 1000 * rcpss
13049   cycles for 1000 * fdiv

3167    cycles for 1000 * rcpss
13135   cycles for 1000 * fdiv

3181    cycles for 1000 * rcpss
13259   cycles for 1000 * fdiv

3189    cycles for 1000 * rcpss
13092   cycles for 1000 * fdiv

3156    cycles for 1000 * rcpss
13070   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000


Of course, precision is lower; the expected value is 123456789.0

The source:
NameA equ rcpss ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1 ; loop 1000x
  push 123456789
  fild stack
  fstp stack
  pop eax
  movd xmm0, eax
  align 4
  .Repeat
rcpss xmm0, xmm0
dec ebx
  .Until Sign?
  movd eax, xmm0
  ret
TestA endp

align_64
NameB equ fdiv ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1 ; loop 1000x
  push 123456789
  fild stack
  fstp stack
  fld1
  align 4
  .Repeat
fld stack
fdiv ST, ST(1)
fstp stack
dec ebx
  .Until Sign?
  fstp st
  pop eax
  ret
TestB endp
Title: Re: 1/x timings for FPU and SIMD code
Post by: Siekmanski on June 23, 2018, 05:58:07 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3868    cycles for 1000 * rcpss
15813   cycles for 1000 * fdiv

3868    cycles for 1000 * rcpss
15826   cycles for 1000 * fdiv

3874    cycles for 1000 * rcpss
15808   cycles for 1000 * fdiv

3875    cycles for 1000 * rcpss
15808   cycles for 1000 * fdiv

3868    cycles for 1000 * rcpss
15821   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000
Title: Re: 1/x timings for FPU and SIMD code
Post by: zedd151 on June 23, 2018, 06:09:51 AM

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

2564    cycles for 1000 * rcpss
17517   cycles for 1000 * fdiv

2552    cycles for 1000 * rcpss
16796   cycles for 1000 * fdiv

2822    cycles for 1000 * rcpss
16735   cycles for 1000 * fdiv

2566    cycles for 1000 * rcpss
16124   cycles for 1000 * fdiv

2636    cycles for 1000 * rcpss
16808   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---


1.60 Ghz as usual
Title: Re: 1/x timings for FPU and SIMD code
Post by: Yuri on June 23, 2018, 02:29:40 PM

Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

928     cycles for 1000 * rcpss
12260   cycles for 1000 * fdiv

903     cycles for 1000 * rcpss
12142   cycles for 1000 * fdiv

889     cycles for 1000 * rcpss
12072   cycles for 1000 * fdiv

893     cycles for 1000 * rcpss
12114   cycles for 1000 * fdiv

894     cycles for 1000 * rcpss
12059   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---
Title: Re: 1/x timings for FPU and SIMD code
Post by: jimg on June 23, 2018, 03:21:00 PM
It shouldn't be a surprise since you're executing three instructions each loop for the fpu vs. one instruction for simd
Title: Re: 1/x timings for FPU and SIMD code
Post by: Mikl__ on June 24, 2018, 10:59:23 AM
Hi,jj2007!
fild stack
at first I was even delighted with the non-standard appeal to the top of the FPU, but
tut_02.asm(8) : error A2006:undefined symbol : stack
(http://cyberstatic.net/images/smilies/scratch_one-s_head.gif)
Title: Re: 1/x timings for FPU and SIMD code
Post by: jj2007 on June 24, 2018, 02:59:14 PM
signed equ sdword ptr
stack equ <DWord Ptr [esp]>
push 9  ; 10 iterations
.Repeat
  ... do something ...
  dec stack
.Until Sign? || signed eax<0
pop edx


@JimG: The fdiv makes it slow. It seems rcpss uses a much faster algorithm.
Title: Re: 1/x timings for FPU and SIMD code
Post by: Mikl__ on June 24, 2018, 03:42:00 PM
Hi, jj2007!
stack equ <DWord Ptr [esp]>
Mille grazie! (http://cyberstatic.net/images/smilies/yes.gif)
Title: Re: 1/x timings for FPU and SIMD code
Post by: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think
Title: Re: 1/x timings for FPU and SIMD code
Post by: jj2007 on June 24, 2018, 11:54:42 PM
Quote from: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we compare against rcpps and divps, divpd?, should be more useful results i think

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3238    cycles for 1000 * rcpss
13094   cycles for 1000 * 1/x using fdiv
10453   cycles for 1000 * 1/x using divss

3157    cycles for 1000 * rcpss
13125   cycles for 1000 * 1/x using fdiv
10602   cycles for 1000 * 1/x using divss

3151    cycles for 1000 * rcpss
13064   cycles for 1000 * 1/x using fdiv
10644   cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

  movd xmm2, FP4(1.0)
  align 4
  .Repeat
movaps xmm1, xmm2 ; reload 1.0
divss xmm1, xmm0 ; divide by 123456789
dec ebx
  .Until Sign?
Title: Re: 1/x timings for FPU and SIMD code
Post by: Siekmanski on June 25, 2018, 12:28:25 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

3864    cycles for 1000 * rcpss
15808   cycles for 1000 * 1/x using fdiv
6211    cycles for 1000 * 1/x using divss

3865    cycles for 1000 * rcpss
15789   cycles for 1000 * 1/x using fdiv
6268    cycles for 1000 * 1/x using divss

3866    cycles for 1000 * rcpss
15797   cycles for 1000 * 1/x using fdiv
6190    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
Title: Re: 1/x timings for FPU and SIMD code
Post by: mineiro on June 25, 2018, 02:07:24 AM
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

2584    cycles for 1000 * rcpss
16274   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

2597    cycles for 1000 * rcpss
16282   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

2587    cycles for 1000 * rcpss
16274   cycles for 1000 * 1/x using fdiv
1818    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---
Title: Re: 1/x timings for FPU and SIMD code
Post by: zedd151 on June 25, 2018, 02:13:21 AM
1.60 Ghz - per usual

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

2353    cycles for 1000 * rcpss
15288   cycles for 1000 * 1/x using fdiv
2070    cycles for 1000 * 1/x using divss

2352    cycles for 1000 * rcpss
15347   cycles for 1000 * 1/x using fdiv
2071    cycles for 1000 * 1/x using divss

2353    cycles for 1000 * rcpss
15283   cycles for 1000 * 1/x using fdiv
2069    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000
--- ok ---


Quote from: daydreamer on June 24, 2018, 08:41:32 PM
Why dont we...

Why don't you ...     ...        ...       ...     ...       post your results?    :P

The more, the merrier.   :biggrin:
Title: Re: 1/x timings for FPU and SIMD code
Post by: Yuri on June 25, 2018, 03:02:08 AM

Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

910     cycles for 1000 * rcpss
12488   cycles for 1000 * 1/x using fdiv
12446   cycles for 1000 * 1/x using divss

880     cycles for 1000 * rcpss
12217   cycles for 1000 * 1/x using fdiv
12854   cycles for 1000 * 1/x using divss

1029    cycles for 1000 * rcpss
12393   cycles for 1000 * 1/x using fdiv
12029   cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---
Title: Re: 1/x timings for FPU and SIMD code
Post by: zedd151 on June 25, 2018, 03:05:28 AM
These results are widely varying.

fdiv is definitely out, but it's a tossup between rcpss and divss.

Title: Re: 1/x timings for FPU and SIMD code
Post by: jj2007 on June 25, 2018, 03:15:00 AM
Quote from: zedd151 on June 25, 2018, 03:05:28 AM
fdiv is definitely out

Not for Yuri's i3 - rcpss is 1/x only, while fdiv and divss are used for generic division. But I agree that on other cpus divss is faster. Whether it matters is another question: Do you have an innermost loop with a Million iterations that needs a division and can live with low precision?
Title: Re: 1/x timings for FPU and SIMD code
Post by: LiaoMi on June 25, 2018, 10:56:44 PM
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

2838 cycles for 1000 * rcpss
12369 cycles for 1000 * 1/x using fdiv
4554 cycles for 1000 * 1/x using divss

2852 cycles for 1000 * rcpss
12214 cycles for 1000 * 1/x using fdiv
4574 cycles for 1000 * 1/x using divss

2873 cycles for 1000 * rcpss
13037 cycles for 1000 * 1/x using fdiv
4570 cycles for 1000 * 1/x using divss

24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss

ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000

--- ok ---
Title: Re: 1/x timings for FPU and SIMD code
Post by: FORTRANS on June 26, 2018, 03:20:18 AM

Cut and paste from screen.
F:\TEMP\TEST>1_DIV_X
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

2088    cycles for 1000 * rcpss
13466   cycles for 1000 * fdiv

2085    cycles for 1000 * rcpss
13432   cycles for 1000 * fdiv

2087    cycles for 1000 * rcpss
13449   cycles for 1000 * fdiv

2050    cycles for 1000 * rcpss
13485   cycles for 1000 * fdiv

2083    cycles for 1000 * rcpss
13454   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

3732 cycles for 1000 * rcpss
15936 cycles for 1000 * fdiv

3736 cycles for 1000 * rcpss
15921 cycles for 1000 * fdiv

3738 cycles for 1000 * rcpss
15888 cycles for 1000 * fdiv

3762 cycles for 1000 * rcpss
15983 cycles for 1000 * fdiv

3762 cycles for 1000 * rcpss
15903 cycles for 1000 * fdiv

24 bytes for rcpss
23 bytes for fdiv

ST0 123453440.0000000000
ST0 123456792.0000000000

--- ok ---

Output redirected to file.
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

2246 cycles for 1000 * rcpss
13534 cycles for 1000 * 1/x using fdiv
16239 cycles for 1000 * 1/x using divss

2078 cycles for 1000 * rcpss
13686 cycles for 1000 * 1/x using fdiv
16026 cycles for 1000 * 1/x using divss

2482 cycles for 1000 * rcpss
13335 cycles for 1000 * 1/x using fdiv
16349 cycles for 1000 * 1/x using divss

24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss

ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000

--- ok ---

F:\TEMP\TEST>1_DIV_X
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

3770    cycles for 1000 * rcpss
15923   cycles for 1000 * 1/x using fdiv
5992    cycles for 1000 * 1/x using divss

3768    cycles for 1000 * rcpss
15919   cycles for 1000 * 1/x using fdiv
5970    cycles for 1000 * 1/x using divss

3770    cycles for 1000 * rcpss
15932   cycles for 1000 * 1/x using fdiv
5970    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---
Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (SSE4)

1433    cycles for 1000 * rcpss
22803   cycles for 1000 * 1/x using fdiv
8498    cycles for 1000 * 1/x using divss

1411    cycles for 1000 * rcpss
21311   cycles for 1000 * 1/x using fdiv
8264    cycles for 1000 * 1/x using divss

1420    cycles for 1000 * rcpss
22523   cycles for 1000 * 1/x using fdiv
8388    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---