News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

fsqrt vs rsqrtss

Started by jj2007, December 30, 2017, 09:25:57 PM

Previous topic - Next topic

Mikl__

Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root

daydreamer

Quote from: Mikl__ on December 31, 2017, 09:45:58 PM
Hi, daydreamer2!
and to you "happy new year!" While there are New Year holidays, I'll think about applying SSE2 in calculating the square root

happy new year Miki :bgrin:
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

daydreamer

hi
want this timings as research for prime test optimisation
starting with my cpu
also got reminded by Zedd that some members has got newer computers after this test
also interested in div,fdiv,divss,divps,divsd,divpd timings
also interested in mul,fmul,mulss,mulps,mulsd,mulpd timings

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

5432    cycles for 100 * fsqrt
4205    cycles for 100 * rsqrtss
4559    cycles for 100 * sqrtss
5435    cycles for 100 * sqrtps

5392    cycles for 100 * fsqrt
4204    cycles for 100 * rsqrtss
4553    cycles for 100 * sqrtss
5450    cycles for 100 * sqrtps

5385    cycles for 100 * fsqrt
4215    cycles for 100 * rsqrtss
4551    cycles for 100 * sqrtss
5435    cycles for 100 * sqrtps

5386    cycles for 100 * fsqrt
4198    cycles for 100 * rsqrtss
4557    cycles for 100 * sqrtss
5451    cycles for 100 * sqrtps

5407    cycles for 100 * fsqrt
4203    cycles for 100 * rsqrtss
4553    cycles for 100 * sqrtss
5449    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

-
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

zedd151

Per Magnus's suggestion...

Dell Optiplex 7050 SFF PC:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (SSE4)

5367    cycles for 100 * fsqrt
4180    cycles for 100 * rsqrtss
4563    cycles for 100 * sqrtss
5435    cycles for 100 * sqrtps

5364    cycles for 100 * fsqrt
4173    cycles for 100 * rsqrtss
4527    cycles for 100 * sqrtss
5384    cycles for 100 * sqrtps

5366    cycles for 100 * fsqrt
4178    cycles for 100 * rsqrtss
4543    cycles for 100 * sqrtss
5475    cycles for 100 * sqrtps

5464    cycles for 100 * fsqrt
4241    cycles for 100 * rsqrtss
4569    cycles for 100 * sqrtss
5499    cycles for 100 * sqrtps

5410    cycles for 100 * fsqrt
4200    cycles for 100 * rsqrtss
4561    cycles for 100 * sqrtss
5424    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

OTVOC 15 laptop:
Intel(R) Celeron(R) N5105 @ 2.00GHz (SSE4)

4494    cycles for 100 * fsqrt
3734    cycles for 100 * rsqrtss
3788    cycles for 100 * sqrtss
4906    cycles for 100 * sqrtps

4481    cycles for 100 * fsqrt
3709    cycles for 100 * rsqrtss
3835    cycles for 100 * sqrtss
4708    cycles for 100 * sqrtps

4462    cycles for 100 * fsqrt
3575    cycles for 100 * rsqrtss
3904    cycles for 100 * sqrtss
4838    cycles for 100 * sqrtps

4536    cycles for 100 * fsqrt
3637    cycles for 100 * rsqrtss
3879    cycles for 100 * sqrtss
4673    cycles for 100 * sqrtps

4409    cycles for 100 * fsqrt
3513    cycles for 100 * rsqrtss
3748    cycles for 100 * sqrtss
4649    cycles for 100 * sqrtps

87      bytes for fsqrt
97      bytes for rsqrtss
89      bytes for sqrtss
87      bytes for sqrtps

9483.79199 = fsqrt
9483.57227 = rsqrtss
9483.79199 = sqrtss
1.16594337e+09 = sqrtps

:cool:   Not too shabby, eh?
¯\_(ツ)_/¯

daydreamer

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

daydreamer

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

zedd151

Quote from: daydreamer on January 11, 2025, 04:41:50 AMfastest sqrt ever  :biggrin:

That is a bold statement, Magnus. That sounds like a challenge. :biggrin:
We'll see... I'll look at that later. :cool:


A few short minutes later...
"shape2.com"  :rolleyes:

Okay, lemme get dosbox...
You cannot view this attachment.

Is that what it is supposed to look like? (its the final screen)
¯\_(ツ)_/¯

daydreamer

#22
Its very fast drawing 4 tiles rotated  90 degrees
Inspired by pixelshaders pixel drawing things with formula per pixel
After that developed straight road,sinus road,45 degree road and scroll thru all 640kb
It was beginning of develop demo that drawed a racetrack or river

Sqrt For purpose of limit prime division test,a sqrt integer proc also has the bonus of detecting some non primes :
Sqrt(4),sqrt(16),sqrt(25),sqrt(36),sqrt(100) ... 0 fraction = can skip division loop entirely

Sqrt only for integer in prime test loop :
;high byte lut only approximate sqrt
;low byte lut sqrt nearest integer

If eax>256
Mov al,ah
Xlat high byte lut
Else
Xlat low byte out
End if

I bet this could be fast



my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

InfiniteLoop

Which is better?
The 1st uses fewer bytes, the 2nd uses fewer cycles, the 3rd is 100% accurate.
That's enough effort spent on square roots for now.

vcvtusi2ss xmm0,xmm0,rcx    ;unsigned long long to float
vrsqrt14ss xmm1,xmm0,xmm0
vmulss xmm3,xmm1,xmm0
vfmsub231ss xmm0,xmm3,xmm3
vaddss xmm2,xmm3,xmm3
vrcpss xmm2,xmm2,xmm2
vfnmadd231ss xmm3,xmm0,xmm2
vcvttss2usi rax,xmm3

vs

vcvtusi2ss xmm0,xmm0,rcx      ;unsigned long long to float
vrsqrt14ss xmm1,xmm0,xmm0
vmulss xmm3,xmm1,xmm0
vmulss xmm2,xmm1,real4 ptr [F_HALF]   ;1/2 * rsqrt(x)
vfmsub231ss xmm0,xmm3,xmm3
vfnmadd231ss xmm3,xmm0,xmm2
vcvttss2usi rax,xmm3

vs

vcvtusi2sd xmm0,xmm0,rcx
vsqrtpd xmm1,xmm0         ;18/12 cycles. expensive function.
vcvttsd2usi rax,xmm1

daydreamer

Thanks infiniteloop
What is best depends on purpose
Fastest enough good for restrict divisions in prime test loop
and sqrps good for 2d round light
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: InfiniteLoop on February 01, 2025, 09:30:27 PMvcvtusi2ss xmm0,xmm0,rcx    ;unsigned long long to float

Crashes with "illegal instruction" on my relatively new cpu

jack

apparently the unsigned instructions vcvtusi2ss and vcvttsd2usi are AVX512 instructions