Hi
is there ways to shuffle this data,or maybe simd mask should be used?
first I gonna shuffle x to all fp,symbolized by 3.0,after that I need to mix 1.0's and x's
;pattern for mulps,to achieve x3,x5,x7,x9,x=3.0
x3x5x7x9 real4 3.0,3.0,3.0,3.0 ;x*x*x =mulps,mulps=x3
real4 1.0,3.0,3.0,3.0
real4 1.0,1.0,3.0,3.0
real4 1.0,1.0,1.0,3.0
Hi Magnus,
It's not clear what needs to be shuffled in what order.
Do you have an example from start positions to end positions?
And which need to be multiplied by 3 5 7 or 9?
Quote from: Siekmanski on July 09, 2018, 07:05:35 AM
Hi Magnus,
It's not clear what needs to be shuffled in what order.
Do you have an example from start positions to end positions?
And which need to be multiplied by 3 5 7 or 9?
Sorry i shorted down comment also,it shall be x^3,x^5,x^7,x^9 ,first step is shuffle one x to 4 x's,mulps until i have x^3,x^3,x^3,x^3 in one xmm reg,change multiplier to 1.0,x,x,x and keep multiply
What is important is get the result in data section posted above
I thought of maybe movups with .data section 1.0,1.0,1.0,1.0,x,x,x,x would be a slow alternative???
SSE2 shift ???
Something like this?
.const
Multipliers real4 8.0, 8.0, 8.0, 8.0 ; ^3
.data
YourData ???????
.code
movaps xmm0,oword ptr Multipliers
movaps xmm1,oword ptr YourData
pshufd xmm2,xmm1,???? ; shuffle your data into place
mulps xmm2,xmm0 ; result
pslld xmm0,2 ; ^5 ( update multipliers from ^3 to ^5 )
; repeat steps with next set of data
many years ago, I seen someone made a integer fast sqrt,maybe the opposite should be possible?
what about SHIFT 32bits combined with OR 1.0,0,0,0 ?
I almost never use shuffles
this is what I have come so far, I am making a sine Taylor series
remember you use radians for x
.code
start:
lea ebx,fconstant
add ebx,16
lea edx,x3x5x7x9
movaps xmm0,x
movaps xmm7,[edx]
movaps xmm6,[ebx]
mulps xmm0,xmm7;x2
mulps xmm0,xmm7;x3
add edx,16
mulps xmm0,[edx];x4 3times
mulps xmm0,[edx];x5 3 times
add edx,16
mulps xmm0,[edx];x6 2 times
mulps xmm0,[edx];x7 2times
add edx,16
mulps xmm0,[edx];x8 1 time
mulps xmm0,[edx];x9 1 time
mulps xmm0,xmm6 ;x reciprocals of 3!,5!,7!,9!,add right - or + signs to prepare for haddps
;haddps here
;haddps
movss sinex,xmm0
We did some trig testing routines a while back on the forum.
The Chebyshev Remez approximation of a 9th degree polynomial came out as the most accurate. ( depends on the number of coeffs of course )
4 optimized constants gives a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi.
double fastsin2(double x)
{
const double a3 = -1.666665709650470145824129400050267289858e-1;
const double a5 = 8.333017291562218127986291618761571373087e-3;
const double a7 = -1.980661520135080504411629636078917643846e-4;
const double a9 = 2.600054767890361277123254766503271638682e-6;
return x + x*x*x * (a3 + x*x * (a5 + x*x * (a7 + x*x * a9))));
}
A routine to calculate 4 real4 sines at once ( could be rewritten to 2 sines and 2 cosines at once )
And a routine to calculate 2 real8 at once.
http://masm32.com/board/index.php?topic=4118.msg49276#msg49276
Thanks for the link,marinus
Check asmc large integers and floats,about real16's
http://masm32.com/board/index.php?topic=6454.15 (http://masm32.com/board/index.php?topic=6454.15)
wouldnt it be good candidate for pi calculation,to use 6*arcsin 0.5
with fixed Point it would be a simple shift instruction to get Powers of 0.5,0.25 etc?