News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

1/x timings for FPU and SIMD code

Started by jj2007, June 23, 2018, 05:22:40 AM

Previous topic - Next topic

jj2007

Quote from: zedd151 on June 25, 2018, 03:05:28 AM
fdiv is definitely out

Not for Yuri's i3 - rcpss is 1/x only, while fdiv and divss are used for generic division. But I agree that on other cpus divss is faster. Whether it matters is another question: Do you have an innermost loop with a Million iterations that needs a division and can live with low precision?

LiaoMi

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

2838 cycles for 1000 * rcpss
12369 cycles for 1000 * 1/x using fdiv
4554 cycles for 1000 * 1/x using divss

2852 cycles for 1000 * rcpss
12214 cycles for 1000 * 1/x using fdiv
4574 cycles for 1000 * 1/x using divss

2873 cycles for 1000 * rcpss
13037 cycles for 1000 * 1/x using fdiv
4570 cycles for 1000 * 1/x using divss

24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss

ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000

--- ok ---

FORTRANS


Cut and paste from screen.
F:\TEMP\TEST>1_DIV_X
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

2088    cycles for 1000 * rcpss
13466   cycles for 1000 * fdiv

2085    cycles for 1000 * rcpss
13432   cycles for 1000 * fdiv

2087    cycles for 1000 * rcpss
13449   cycles for 1000 * fdiv

2050    cycles for 1000 * rcpss
13485   cycles for 1000 * fdiv

2083    cycles for 1000 * rcpss
13454   cycles for 1000 * fdiv

24      bytes for rcpss
23      bytes for fdiv

ST0     123453440.0000000000
ST0     123456792.0000000000

--- ok ---Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

3732 cycles for 1000 * rcpss
15936 cycles for 1000 * fdiv

3736 cycles for 1000 * rcpss
15921 cycles for 1000 * fdiv

3738 cycles for 1000 * rcpss
15888 cycles for 1000 * fdiv

3762 cycles for 1000 * rcpss
15983 cycles for 1000 * fdiv

3762 cycles for 1000 * rcpss
15903 cycles for 1000 * fdiv

24 bytes for rcpss
23 bytes for fdiv

ST0 123453440.0000000000
ST0 123456792.0000000000

--- ok ---

Output redirected to file.
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

2246 cycles for 1000 * rcpss
13534 cycles for 1000 * 1/x using fdiv
16239 cycles for 1000 * 1/x using divss

2078 cycles for 1000 * rcpss
13686 cycles for 1000 * 1/x using fdiv
16026 cycles for 1000 * 1/x using divss

2482 cycles for 1000 * rcpss
13335 cycles for 1000 * 1/x using fdiv
16349 cycles for 1000 * 1/x using divss

24 bytes for rcpss
23 bytes for 1/x using fdiv
39 bytes for 1/x using divss

ST0 123453440.0000000000
ST0 123456792.0000000000
ST0 123453440.0000000000

--- ok ---

F:\TEMP\TEST>1_DIV_X
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

3770    cycles for 1000 * rcpss
15923   cycles for 1000 * 1/x using fdiv
5992    cycles for 1000 * 1/x using divss

3768    cycles for 1000 * rcpss
15919   cycles for 1000 * 1/x using fdiv
5970    cycles for 1000 * 1/x using divss

3770    cycles for 1000 * rcpss
15932   cycles for 1000 * 1/x using fdiv
5970    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---
Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (SSE4)

1433    cycles for 1000 * rcpss
22803   cycles for 1000 * 1/x using fdiv
8498    cycles for 1000 * 1/x using divss

1411    cycles for 1000 * rcpss
21311   cycles for 1000 * 1/x using fdiv
8264    cycles for 1000 * 1/x using divss

1420    cycles for 1000 * rcpss
22523   cycles for 1000 * 1/x using fdiv
8388    cycles for 1000 * 1/x using divss

24      bytes for rcpss
23      bytes for 1/x using fdiv
39      bytes for 1/x using divss

ST0     123453440.0000000000
ST0     123456792.0000000000
ST0     123453440.0000000000

--- ok ---