Hi qWord!!
There is an issue in asin():fslv_fnc_asin macro
fld st
fmulp st,st
fld1
fsubr fpu_const_r4_one
fsqrt
fpatan
endm
And should be:fslv_fnc_asin macro
fld st
fmul st,st
fsubr fpu_const_r4_one
fsqrt
fpatan
endm
Regards. HSE
I believe qWord's procedure is basicly correct. The intent is to compute the equivalent cosine according to the relation
sin2 + cos2 = 1
which translates into
cos = sqrt(1-sin2)
And then you get the angle from the arctan of the sin/cos ratio.
The only concern I have is the use of "fpu_const_r4_one" which I don't know what it is, and probably should not even be there.
Hi Raymond!
The entire qWord's macro system is in SmplMath (https://sourceforge.net/projects/smplmath/)
For sure, something else is in arcsin function place. I have not studied the code very much, instead I tested results against other programs :biggrin:
LATER
Perhaps the idea was:fslv_fnc_asin macro
fld st
fmul st,st ; without p
fld1
fsubr ; without fpu_const_r4_one
fsqrt
fpatan
endm
QuotePerhaps the idea was:
Could very well be that he forgot to comment it out after some other modification. Good point. Qword should be able to confirm that.
yes, I confirm this a bug caused by an modification I did with version 2.0.
The FLD1 should be omitted and FMULP becomes FMUL. The subtraction 1-x2 is then done using the constant fpu_const_r4_one (value=1.0) as argument for FSUBR.
The idea was to keep the FPU-stack usage as small as possible - the version with FLD1 needs one more free FPU-register.
I will fix that if time permitting it. Until then you can correct the macro yourself as described (using FLD1-version would be wrong, because of additional FPU-register usage)
regards
qWord
I was thinking that perhaps with fld1 is fast. But most of the time the opposite is true.; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this console app with
"MAKEIT.BAT" on the PROJECT menu.
----------------------------------------------------- *
.data?
value dd ?
.data
veces dd 10000000
item dd 0
x1 dq 0.1596
fp64 dq 0.0
fpu_const_r4_one dq 1.0
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
cls
finit
mov ecx , 5
mayor:
push ecx
mov item, rv(GetTickCount)
mov ecx, veces
@empieza:
fld x1
fld st
fmul st,st
fld1
fsubr; fpu_const_r4_one
fsqrt
fpatan
fstp fp64
loop @empieza
sub rv(GetTickCount), item
printf("%d\t is a value\n", eax);
mov item, rv(GetTickCount)
mov ecx, veces
@empieza2:
fld x1
fld st
fmul st,st
fsubr fpu_const_r4_one
fsqrt
fpatan
fstp fp64
loop @empieza2
sub rv(GetTickCount), item
printf("%d\t is a value\n", eax);
printf("\t \n");
pop ecx
dec ecx
.if ecx > 0
jmp mayor
.endif
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
1029 is a value
1030 is a value
1045 is a value
1030 is a value
998 is a value
983 is a value
998 is a value
983 is a value
983 is a value
998 is a value
Press any key to continue ....
Quote from: HSE on February 26, 2017, 10:24:20 AM
I was thinking that perhaps with fld1 is fast.
fld1
is fast (approx. 1 cycle), especially when followed by ultra-slow fsqrt or fpatan:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
143 cycles for 100 * fld1
62 cycles for 100 * fld one real
62 cycles for 100 * fld one int
63 cycles for 100 * fldpi
1818 cycles for 100 * fsqrt
28991 cycles for 100 * fpatan
144 cycles for 100 * fld1
172 cycles for 100 * fld one real
172 cycles for 100 * fld one int
143 cycles for 100 * fldpi
1817 cycles for 100 * fsqrt
29007 cycles for 100 * fpatan
143 cycles for 100 * fld1
62 cycles for 100 * fld one real
62 cycles for 100 * fld one int
63 cycles for 100 * fldpi
1820 cycles for 100 * fsqrt
28994 cycles for 100 * fpatan
The real surprise here is that fld1 is a little bit slower than fldpi (same for FLDL2E etc).
Pefect JJ! :t
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
?? cycles for 1000 * fld1
232 cycles for 1000 * fld one real
140 cycles for 1000 * fld one int
?? cycles for 1000 * fldpi
29720 cycles for 1000 * fsqrt
44963 cycles for 1000 * fpatan
212128 cycles for 1000 * WithFld1
212253 cycles for 1000 * WithoutFld1
?? cycles for 1000 * fld1
11 cycles for 1000 * fld one real
141 cycles for 1000 * fld one int
4 cycles for 1000 * fldpi
29691 cycles for 1000 * fsqrt
45041 cycles for 1000 * fpatan
212272 cycles for 1000 * WithFld1
212404 cycles for 1000 * WithoutFld1
0 cycles for 1000 * fld1
38 cycles for 1000 * fld one real
143 cycles for 1000 * fld one int
?? cycles for 1000 * fldpi
29729 cycles for 1000 * fsqrt
44907 cycles for 1000 * fpatan
212263 cycles for 1000 * WithFld1
212198 cycles for 1000 * WithoutFld1
4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan
24 bytes for WithFld1
26 bytes for WithoutFld1
--- ok ---
With these numbers, one free FPU register cost 0.03% of time. Cheap, I think :biggrin:.
Thanks. HSE
Hi,
Some results from the oldie, moldy CPU collection. Somewhat
weird results?
P-III
pre-P4 (SSE1)
2 cycles for 100 * fld1
103 cycles for 100 * fld one real
193 cycles for 100 * fld one int
42 cycles for 100 * fldpi
6861 cycles for 100 * fsqrt
10301 cycles for 100 * fpatan
2 cycles for 100 * fld1
103 cycles for 100 * fld one real
192 cycles for 100 * fld one int
42 cycles for 100 * fldpi
6849 cycles for 100 * fsqrt
10300 cycles for 100 * fpatan
1 cycles for 100 * fld1
103 cycles for 100 * fld one real
192 cycles for 100 * fld one int
41 cycles for 100 * fldpi
6849 cycles for 100 * fsqrt
10297 cycles for 100 * fpatan
4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan
--- ok ---
P-MMX
pre-P4
303 cycles for 100 * fld1
201 cycles for 100 * fld one real
403 cycles for 100 * fld one int
911 cycles for 100 * fldpi
8012 cycles for 100 * fsqrt
7287 cycles for 100 * fpatan
303 cycles for 100 * fld1
202 cycles for 100 * fld one real
405 cycles for 100 * fld one int
914 cycles for 100 * fldpi
8044 cycles for 100 * fsqrt
7283 cycles for 100 * fpatan
305 cycles for 100 * fld1
200 cycles for 100 * fld one real
402 cycles for 100 * fld one int
915 cycles for 100 * fldpi
7994 cycles for 100 * fsqrt
7285 cycles for 100 * fpatan
4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
206 cycles for 100 * fld1
99 cycles for 100 * fld one real
280 cycles for 100 * fld one int
242 cycles for 100 * fldpi
7307 cycles for 100 * fsqrt
14883 cycles for 100 * fpatan
204 cycles for 100 * fld1
101 cycles for 100 * fld one real
271 cycles for 100 * fld one int
237 cycles for 100 * fldpi
6864 cycles for 100 * fsqrt
14878 cycles for 100 * fpatan
205 cycles for 100 * fld1
98 cycles for 100 * fld one real
274 cycles for 100 * fld one int
247 cycles for 100 * fldpi
6860 cycles for 100 * fsqrt
14882 cycles for 100 * fpatan
4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan
--- ok ---
HTH,
Steve N.
:biggrin:
Yeah, the results look a bit weird. The cycles are taken from the difference between a full loop minus the empty loop. That doesn't work exactly the same way with all processors. On the positive side, the timings are usually quite stable :bgrin: