News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

asin() in SmplMath

Started by HSE, February 22, 2017, 11:07:37 AM

Previous topic - Next topic

HSE

Hi qWord!!

There is an issue in asin():fslv_fnc_asin macro
  fld st
  fmulp st,st
  fld1
  fsubr fpu_const_r4_one
  fsqrt
  fpatan
endm


And should be:fslv_fnc_asin macro
  fld st
  fmul st,st
  fsubr fpu_const_r4_one
  fsqrt
  fpatan
endm


Regards. HSE
Equations in Assembly: SmplMath

raymond

I believe qWord's procedure is basicly correct. The intent is to compute the equivalent cosine according to the relation
sin2 + cos2 = 1

which translates into
cos = sqrt(1-sin2)

And then you get the angle from the arctan of the sin/cos ratio.

The only concern I have is the use of "fpu_const_r4_one" which I don't know what it is, and probably should not even be there.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

HSE

#2
Hi Raymond!

The entire qWord's macro system is in SmplMath

For sure, something else is in arcsin function place. I have not studied the code very much, instead I tested results against other programs :biggrin:

LATER

Perhaps the idea was:fslv_fnc_asin macro
  fld st
  fmul st,st ; without p
  fld1
  fsubr    ; without fpu_const_r4_one
  fsqrt
  fpatan
endm
Equations in Assembly: SmplMath

raymond

QuotePerhaps the idea was:

Could very well be that he forgot to comment it out after some other modification. Good point. Qword should be able to confirm that.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

qWord

yes, I confirm this a bug caused by an modification I did with version 2.0.

The FLD1 should be omitted and FMULP becomes FMUL. The subtraction 1-x2 is then done using the constant fpu_const_r4_one (value=1.0) as argument for FSUBR.
The idea was to keep the FPU-stack usage as small as possible - the version with FLD1 needs one more free FPU-register.

I will fix that if time permitting it. Until then you can correct the macro yourself as described (using FLD1-version would be wrong, because of additional FPU-register usage)

regards
qWord
MREAL macros - when you need floating point arithmetic while assembling!

HSE

I was thinking that perhaps with fld1 is fast. But most of the time the opposite is true.; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                     Build this console app with
                  "MAKEIT.BAT" on the PROJECT menu.
        ----------------------------------------------------- *

    .data?
value dd ?

    .data
veces dd 10000000
item dd 0

x1 dq 0.1596
fp64 dq 0.0

fpu_const_r4_one dq 1.0
   
    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc
   cls

  finit
mov ecx , 5
mayor:
push ecx
mov item, rv(GetTickCount)
mov ecx, veces
@empieza:
fld x1
  fld st
  fmul st,st
  fld1
  fsubr;         fpu_const_r4_one
  fsqrt
  fpatan
fstp fp64

    loop @empieza

sub rv(GetTickCount), item
printf("%d\t is a value\n", eax);

mov item, rv(GetTickCount)
mov ecx, veces
@empieza2:
fld x1
fld st
  fmul st,st
  fsubr fpu_const_r4_one
  fsqrt
  fpatan
fstp fp64
    loop @empieza2
   
sub rv(GetTickCount), item
printf("%d\t is a value\n", eax);
printf("\t \n");
pop ecx
dec ecx
.if ecx > 0
    jmp mayor
.endif
    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


1029     is a value
1030     is a value

1045     is a value
1030     is a value

998      is a value
983      is a value

998      is a value
983      is a value

983      is a value
998      is a value

Press any key to continue ....

Equations in Assembly: SmplMath

jj2007

Quote from: HSE on February 26, 2017, 10:24:20 AM
I was thinking that perhaps with fld1 is fast.

fld1 is fast (approx. 1 cycle), especially when followed by ultra-slow fsqrt or fpatan:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

143     cycles for 100 * fld1
62      cycles for 100 * fld one real
62      cycles for 100 * fld one int
63      cycles for 100 * fldpi
1818    cycles for 100 * fsqrt
28991   cycles for 100 * fpatan

144     cycles for 100 * fld1
172     cycles for 100 * fld one real
172     cycles for 100 * fld one int
143     cycles for 100 * fldpi
1817    cycles for 100 * fsqrt
29007   cycles for 100 * fpatan

143     cycles for 100 * fld1
62      cycles for 100 * fld one real
62      cycles for 100 * fld one int
63      cycles for 100 * fldpi
1820    cycles for 100 * fsqrt
28994   cycles for 100 * fpatan


The real surprise here is that fld1 is a little bit slower than fldpi (same for FLDL2E etc).

HSE

Pefect JJ!  :t

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

??      cycles for 1000 * fld1
232     cycles for 1000 * fld one real
140     cycles for 1000 * fld one int
??      cycles for 1000 * fldpi
29720   cycles for 1000 * fsqrt
44963   cycles for 1000 * fpatan
212128  cycles for 1000 * WithFld1
212253  cycles for 1000 * WithoutFld1

??      cycles for 1000 * fld1
11      cycles for 1000 * fld one real
141     cycles for 1000 * fld one int
4       cycles for 1000 * fldpi
29691   cycles for 1000 * fsqrt
45041   cycles for 1000 * fpatan
212272  cycles for 1000 * WithFld1
212404  cycles for 1000 * WithoutFld1

0       cycles for 1000 * fld1
38      cycles for 1000 * fld one real
143     cycles for 1000 * fld one int
??      cycles for 1000 * fldpi
29729   cycles for 1000 * fsqrt
44907   cycles for 1000 * fpatan
212263  cycles for 1000 * WithFld1
212198  cycles for 1000 * WithoutFld1

4       bytes for fld1
8       bytes for fld one real
8       bytes for fld one int
4       bytes for fldpi
6       bytes for fsqrt
6       bytes for fpatan
24      bytes for WithFld1
26      bytes for WithoutFld1

--- ok ---


With these numbers, one free FPU register cost 0.03% of time. Cheap, I think :biggrin:.

Thanks. HSE
Equations in Assembly: SmplMath

FORTRANS

Hi,

   Some results from the oldie, moldy CPU collection.  Somewhat
weird results?

P-III
pre-P4 (SSE1)

2 cycles for 100 * fld1
103 cycles for 100 * fld one real
193 cycles for 100 * fld one int
42 cycles for 100 * fldpi
6861 cycles for 100 * fsqrt
10301 cycles for 100 * fpatan

2 cycles for 100 * fld1
103 cycles for 100 * fld one real
192 cycles for 100 * fld one int
42 cycles for 100 * fldpi
6849 cycles for 100 * fsqrt
10300 cycles for 100 * fpatan

1 cycles for 100 * fld1
103 cycles for 100 * fld one real
192 cycles for 100 * fld one int
41 cycles for 100 * fldpi
6849 cycles for 100 * fsqrt
10297 cycles for 100 * fpatan

4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan


--- ok ---

P-MMX
pre-P4
303 cycles for 100 * fld1
201 cycles for 100 * fld one real
403 cycles for 100 * fld one int
911 cycles for 100 * fldpi
8012 cycles for 100 * fsqrt
7287 cycles for 100 * fpatan

303 cycles for 100 * fld1
202 cycles for 100 * fld one real
405 cycles for 100 * fld one int
914 cycles for 100 * fldpi
8044 cycles for 100 * fsqrt
7283 cycles for 100 * fpatan

305 cycles for 100 * fld1
200 cycles for 100 * fld one real
402 cycles for 100 * fld one int
915 cycles for 100 * fldpi
7994 cycles for 100 * fsqrt
7285 cycles for 100 * fpatan

4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

206 cycles for 100 * fld1
99 cycles for 100 * fld one real
280 cycles for 100 * fld one int
242 cycles for 100 * fldpi
7307 cycles for 100 * fsqrt
14883 cycles for 100 * fpatan

204 cycles for 100 * fld1
101 cycles for 100 * fld one real
271 cycles for 100 * fld one int
237 cycles for 100 * fldpi
6864 cycles for 100 * fsqrt
14878 cycles for 100 * fpatan

205 cycles for 100 * fld1
98 cycles for 100 * fld one real
274 cycles for 100 * fld one int
247 cycles for 100 * fldpi
6860 cycles for 100 * fsqrt
14882 cycles for 100 * fpatan

4 bytes for fld1
8 bytes for fld one real
8 bytes for fld one int
4 bytes for fldpi
6 bytes for fsqrt
6 bytes for fpatan


--- ok ---


HTH,

Steve N.

jj2007

 :biggrin:

Yeah, the results look a bit weird. The cycles are taken from the difference between a full loop minus the empty loop. That doesn't work exactly the same way with all processors. On the positive side, the timings are usually quite stable :bgrin: