Hi guys,
I am trying to work out how to do a simple floating point subtraction but the result is incorrect.
fild valA
fild valB
fsub
fstp result
valA is a real4 of 1000.0 and valB is a real4 of 1.0
I was expecting a result of 999.0 but I get -8.34929E+07.
Any advice would be much appreciated 8)
Ah! Worked it out. I should have been using fld not fild :eusa_clap:
For simple math and sqrt and rsqrt use SSE instead,its easier to make code faster in a loop
movss xmm0,val1
subss xmm0,val2
movss result,xmm0
And when you need speedup,use movaps,subps etc instead
Awesome! Thanks for the tip 8)
Quote from: daydreamer on April 16, 2018, 08:58:39 PM
For simple math and sqrt and rsqrt use SSE instead,its easier to make code faster in a loop
movss xmm0,val1
subss xmm0,val2
movss result,xmm0
And if you are not absolutely sure, time it:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
87 cycles for 100 * movss
59 cycles for 100 * fpu
80 cycles for 100 * movss
59 cycles for 100 * fpu
80 cycles for 100 * movss
59 cycles for 100 * fpu
84 cycles for 100 * movss
59 cycles for 100 * fpu
80 cycles for 100 * movss
60 cycles for 100 * fpu
24 bytes for movss
18 bytes for fpu
movss:
movss xmm0, val1
subss xmm0, val2
movss result, xmm0
fpu:
fld val1
fsub val2
fstp result
Yep. Been timing routines heaps today.
Trying to make them as tight as possible as they are performance critical for my project.
Lot's of testing, timing, and learning going on :biggrin:
Having another small issue.
I am trying to calculate the tangent of a value using the fp* commands.
fld VALUE ;// contains 0.523599
fptan
fstp RESULT ;// should be 0.57735
According to ConverterDD (http://masm32.com/board/index.php?topic=1819.0) the value in RESULT is NaN.
ConverterDD has being displaying all results correctly so far.
Am I using fptan correctly?
Quote from: Lonewolff on April 16, 2018, 10:22:06 PMAm I using fptan correctly?
What does the help file say?
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
fld FP4(0.523599)
fptan
deb 4, "The FPU:", ST(0), ST(1)
EndOfCodeThe FPU:
ST(0) 1.000000000000000000
ST(1) 0.5773506065083982818
Quote from: Lonewolff on April 16, 2018, 10:22:06 PM
Having another small issue.
I am trying to calculate the tangent of a value using the fp* commands.
Quote
fld VALUE ;// contains 0.523599
fptan
fstp st ; remove 1.0 from st(0)<<-- See Simply FPU by Raymond
fstp RESULT ;// should be 0.57735
According to ConverterDD (http://masm32.com/board/index.php?topic=1819.0 (http://masm32.com/board/index.php?topic=1819.0)) the value in RESULT is NaN.
ConverterDD has being displaying all results correctly so far.
Am I using fptan correctly? <<<--- NO, it seems you want tan(VALUE)=0.5773505683919327
Hi
Could you post this simple example (the asm file) ?
Quote from: jj2007 on April 16, 2018, 10:43:20 PM
What does the help file say?
Help file says I am. The result says otherwise.
@RuiLoureiro - Will do :t
This works though.
fld number ;// contains 0.523599
fptan
fstp result ;// contains 1 why?
fstp result ;// should be 0.57735
Why is it that two fstp calls are required?
QuoteWhy is it that two fstp calls are required?
If you follow up on reading the recommended FPU tutorial (more specifically the part relating to the fptan instruction at http://www.ray.masmcode.com/tutorial/fpuchap10.htm#fptan) you would get your answer to your questions, including the last.
It may also give you a hint to explain one of your previous comment:
QuoteAccording to ConverterDD (http://masm32.com/board/index.php?topic=1819.0) the value in RESULT is NaN.
That may possibly be due to valid data already being in the ST(7) and/or the ST(0) register when attempting to compute the tangent. Otherwise
a value of "1" should have been returned by ConverterDD.
Thanks for the link to the tutorial. Clears it up well :t
Quote from: Lonewolff on April 17, 2018, 10:12:49 AM
This works though.
Quote
fld number ;// contains 0.523599
fptan
fstp resultX ;// contains 1 why? <<<- because fptan gives 2 results, not 1
fstp result ;// should be 0.57735
Why is it that two fstp calls are required?
What do you get for the new
resultX variable ? Try to see. Print it. It should be 1.0 as Raymond said.
fstp st should be used in this case, it is faster. It removes the current st(0) when we dont need it.
:t
Quote from: RuiLoureiro on April 17, 2018, 10:41:22 PMfstp st should be used, it is faster. It removes the current st(0) when we dont need it.
Btw there is also
fincstp, which at first sight has the same effect. But try a simple fldpi afterwards, and you'll see the difference. Olly has a section with the FPU regs, you must scroll down a little bit the upper right pane to see it.
Quote from: jj2007 on April 17, 2018, 10:55:42 PM
Quote from: RuiLoureiro on April 17, 2018, 10:41:22 PMfstp st should be used, it is faster. It removes the current st(0) when we dont need it.
Btw there is also fincstp, which at first sight has the same effect. But try a simple fldpi afterwards, and you'll see the difference. Olly has a section with the FPU regs, you must scroll down a little bit the upper right pane to see it.
Ok Jochen. I dont know where i have Olly in this computer. I rarely use it. LoneWolff may do this to learn.
What i want to say to LoneWolff is this:
generally, we use fstp st to remove st(0) when we dont need it.
Yeah, the fstp resultX call was just to get rid of the value. I'll swap out the command now that that I know about it.
Thanks guys! Awesome information as always 8)
Quote from: Lonewolff on April 18, 2018, 07:47:06 AM
Yeah, the fstp resultX call was just to get rid of the value. I'll swap out the command now that that I know about it.
Thanks guys! Awesome information as always 8)
Good luck :t
Quote from: RuiLoureiro on April 18, 2018, 08:03:43 AM
Good luck :t
Thanks Rui :biggrin:
There was a measurable performance increase with the
fstp st method.
fld number
fptan
fstp st
fstp result
But
fincstp gave me a NaN result.
fld number
fptan
fincstp
fstp result
Try this one,
ffree st(0)
fld number
fptan
ffree st(0)
fstp result
That one gives me a Nan result also :(
Was just reading about the fincstp method.
http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc98.htm
This operation is not equivalent to popping the stack, because the tag for the previous top-of-stack register is not marked empty.
Quote from: Lonewolff on April 18, 2018, 08:26:21 AM
fld number
fptan
ffree st(0)
fstp result
That one gives me a Nan result also :(
Let's see if you can use your skills to find out in the tutorial why you are getting that result. ;)
Quote from: raymond on April 18, 2018, 10:18:38 AM
Quote from: Lonewolff on April 18, 2018, 08:26:21 AM
fld number
fptan
ffree st(0)
fstp result
That one gives me a Nan result also :(
Let's see if you can use your skills to find out in the tutorial why you are getting that result. ;)
Ha! Got me again.
It's that double pop thing again. Had only just woken up when I tried it out. :t
Quote from: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree st(0)
That instruction is about as rare as fincstp, and for the same reason :P
Quote from: Lonewolff on April 18, 2018, 08:34:38 AMThis operation is not equivalent to popping the stack, because the tag for the previous top-of-stack register is not marked empty.
Exactly :t
The point here is: fincstp rotates ST(0) into ST(7). If ST(7) is not empty, i.e. it carries a value, then any attempt to load ST(0) will fail.
In practice, you will never see ffree ST(0) or fincstp, but you will often see
fstp st ; pop ST, clear ST(7)
ffree ST(7) ; clear ST(7)
fxch ; to get ST(1), exchange ST(0), ST(1)
Quote from: jj2007 on April 18, 2018, 11:21:32 AM
Quote from: Siekmanski on April 18, 2018, 08:22:52 AM
Try this one,
ffree st(0)
That instruction is about as rare as fincstp, and for the same reason :P
I found that using that instruction, I had to
fincstp anyway to get the right result. Which negated any performance gains.
Seems that
fstp st has given the best performance so far.
Hi LoneWolff,
fstp st is the same as fstp st(0).
But there are cases where we want to remove st(1) ...
In these cases we use fstp st(1) ...
In some cases, after one FPU instruction, we add this code to detect an error:
fstsw ax ; store Status Word register to AX register
fwait
shr ax, 1 ; move bit 0 to carry flag
jc _iserror
; go on, here not error
...
; exit here without error
_iserror: fstp st ; remove st(0)
fclex ; clear all bits in the status word register -> new instruction, new error ?
; exit here with an error message
Hey guys,
I thought I'd ask this in this topic rather than create a new one, as it is still a FP question.
Is there a more efficient way of writing this? I regards to the final 'fstp' call.
This code works fine, just wondering if it is optimal.
fld fP ; farPlane / (farPlane - nearPlane)
fld nP
fsub
fld fP
fdiv st(0),st(1)
fstp _res
mov ecx, _res
mov [eax+40], ecx ; Store result in _33
fstp _res ; Clear the last value from the FP stack
If I don't call the final line, a value gets left on the FP stack. I could call 'fpinit' but that is a very slow call.
Just wondering if I am going about this the right way.
Thanks again 8)
[edit]
Actually, thinking about it, I probably just need to play with the order of operation.
[edit2]
Yep worked well. And one line less.
fld fP ; farPlane / (farPlane - nearPlane)
fld fP
fld nP
fsub
fdiv
fstp _res
mov ecx, _res
mov [eax+40], ecx ; Store result in _33
fld fp ; farPlane / (farPlane - nearPlane)
fld st
fld rp
fsubp
fdivp
fstp real4 ptr [eax+40]
:biggrin:
Nice!
Even better :eusa_clap:
[edit]
Hang on. It says invalid operands on fsubp and fdivp. :P
Were those two lines a typo?
No, it works.
fsub with a register pop
fdiv with a register pop
edit: I wrote rp instead of np, the rest is ok.
fld farPlane ; farPlane / (farPlane - nearPlane)
fld st(0)
fld nearPlane
fsubp
fdivp
fstp real4 ptr [eax+40]
Weird :icon_confused:
Doesn't compile for me with the 'p' suffix.
Quote
(22) : error A2070: invalid instruction operands
(23) : error A2070: invalid instruction operands
fld fP ; farPlane / (farPlane - nearPlane)
fld fP
fld nP
fsubp ; line 22
fdivp ; line 23
fstp real4 ptr [eax+40] ; Store result in _33
Strange, it's a valid instruction.
https://www.coursehero.com/file/p42643a0/The-FSUBP-instructions-perform-the-additional-operation-of-popping-the-FPU/
https://www.coursehero.com/file/p42643a0/The-FDIVP-instructions-perform-the-additional-operation-of-popping-the-FPU/
I would have coded it as follows:
fld fp ;fp
fld st ;fp fp ;make copy in st(0)
fsub np ;(fp-fn) fp ;subtract fn from fp in st(0)
fdiv ;fp/(fp-fn) ;divide fp in st(1) by the content of st(0) and pop st(0)
;st(0) now contains the result of the division
fstp real4 ptr [eax+40] ;store result and clean fpu
Check the 'Chapter 8 - Arithmetic instructions - with REAL numbers' at http://www.ray.masmcode.com/tutorial/fpuchap8.htm
Thanks for the 'fsub' tip. :t
I am finding that this...
fld fP
fld fP
...is bench marking faster than this though.
fld fP
fld st
Quote from: Lonewolff on April 24, 2018, 01:17:57 PM
Thanks for the 'fsub' tip. :t
I am finding that this...
fld fP
fld fP
...is bench marking faster than this though.
fld fP
fld st
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.
Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).
fld ST version 6578 ms
fld fP version 6422 ms
No matter how many times I run it the fP version is quicker on every occasion.
Wow, the CPU really is a quirking beast.
I have another routine that uses the 'fchs' opcode and I can speed up (or slow down) the tests by 200 ms, just by changing when 'fchs' is called. :dazzled:
I have run it over and over and the results consistently differ by 200 ms, just depending on the placement of 'fchs'.
Quote from: Lonewolff on April 24, 2018, 10:25:34 AM
Weird :icon_confused:
Doesn't compile for me with the 'p' suffix.
fsubp ; line 22
fdivp ; line 23
You may use:
- fsub (no p, but does the same)
- fsubp st(1), st
- an assembler higher than ML 6.15 (8.0+, UAsm, AsmC, JWasm)
Quote from: Lonewolff on April 24, 2018, 01:28:16 PM
Quote from: raymond on April 24, 2018, 01:24:30 PM
Truly surprising. Fetching data from memory, even from cache, is usually considered slower than register-to-register operations.
Yeah I am looping the routine 1000000000 times and timing how long to complete (the whole code snippet that is).
fld ST version 6578 ms
fld fP version 6422 ms
No matter how many times I run it the fP version is quicker on every occasion.
It seems that you are testing the code on your current CPU.
Is it faster on another CPU ?
Any current load is to st(0). So it seems more logical to
copy the current st(0)
into a new st(0) - fld st(0) - than to
load the same memory value into a new st(0).
So i prefer fld ST version.
It's tested on a relatively recent AMD. I have an I7 here as well so I will test on that also to compare results.
Will let you know :t
Weird ::)Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
170 cycles for 100 * fld Real4 mem, mem
269 cycles for 100 * fld Real4 mem, st
169 cycles for 100 * fld Real8 mem, mem
270 cycles for 100 * fld Real8 mem, st
372 cycles for 100 * fld Real10 mem, mem
372 cycles for 100 * fld Real10 mem, st
169 cycles for 100 * fld Real4 mem, mem
267 cycles for 100 * fld Real4 mem, st
169 cycles for 100 * fld Real8 mem, mem
267 cycles for 100 * fld Real8 mem, st
372 cycles for 100 * fld Real10 mem, mem
374 cycles for 100 * fld Real10 mem, st
169 cycles for 100 * fld Real4 mem, mem
269 cycles for 100 * fld Real4 mem, st
169 cycles for 100 * fld Real8 mem, mem
267 cycles for 100 * fld Real8 mem, st
373 cycles for 100 * fld Real10 mem, mem
373 cycles for 100 * fld Real10 mem, st
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
Intel(R) Celeron(R) CPU N2840 @ 2.16GHz (SSE4)
166 cycles for 100 * fld Real4 mem, mem
165 cycles for 100 * fld Real4 mem, st
175 cycles for 100 * fld Real8 mem, mem
168 cycles for 100 * fld Real8 mem, st
1029 cycles for 100 * fld Real10 mem, mem
596 cycles for 100 * fld Real10 mem, st
163 cycles for 100 * fld Real4 mem, mem
163 cycles for 100 * fld Real4 mem, st
163 cycles for 100 * fld Real8 mem, mem
168 cycles for 100 * fld Real8 mem, st
1041 cycles for 100 * fld Real10 mem, mem
602 cycles for 100 * fld Real10 mem, st
170 cycles for 100 * fld Real4 mem, mem
164 cycles for 100 * fld Real4 mem, st
174 cycles for 100 * fld Real8 mem, mem
169 cycles for 100 * fld Real8 mem, st
1056 cycles for 100 * fld Real10 mem, mem
611 cycles for 100 * fld Real10 mem, st
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
168 cycles for 100 * fld Real4 mem, mem
266 cycles for 100 * fld Real4 mem, st
168 cycles for 100 * fld Real8 mem, mem
272 cycles for 100 * fld Real8 mem, st
373 cycles for 100 * fld Real10 mem, mem
373 cycles for 100 * fld Real10 mem, st
167 cycles for 100 * fld Real4 mem, mem
268 cycles for 100 * fld Real4 mem, st
168 cycles for 100 * fld Real8 mem, mem
268 cycles for 100 * fld Real8 mem, st
373 cycles for 100 * fld Real10 mem, mem
372 cycles for 100 * fld Real10 mem, st
167 cycles for 100 * fld Real4 mem, mem
266 cycles for 100 * fld Real4 mem, st
168 cycles for 100 * fld Real8 mem, mem
267 cycles for 100 * fld Real8 mem, st
373 cycles for 100 * fld Real10 mem, mem
373 cycles for 100 * fld Real10 mem, st
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
12 cycles for 100 * fld Real4 mem, mem
11 cycles for 100 * fld Real4 mem, st
12 cycles for 100 * fld Real8 mem, mem
11 cycles for 100 * fld Real8 mem, st
612 cycles for 100 * fld Real10 mem, mem
318 cycles for 100 * fld Real10 mem, st
12 cycles for 100 * fld Real4 mem, mem
10 cycles for 100 * fld Real4 mem, st
11 cycles for 100 * fld Real8 mem, mem
11 cycles for 100 * fld Real8 mem, st
612 cycles for 100 * fld Real10 mem, mem
317 cycles for 100 * fld Real10 mem, st
13 cycles for 100 * fld Real4 mem, mem
9 cycles for 100 * fld Real4 mem, st
12 cycles for 100 * fld Real8 mem, mem
10 cycles for 100 * fld Real8 mem, st
612 cycles for 100 * fld Real10 mem, mem
319 cycles for 100 * fld Real10 mem, st
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
What happen here? Cycles are so slower than the other machines?
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?
No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
mov ebx, 99 ; loop 100x
align 4
.Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
.Until Sign?
So I stumbled across something hey? :lol:
Heaps faster all round for real4's to do two fld's. :t
Maybe a caching thing in the CPU itself? Knows it already has the value there so just re-uses it perhaps?
Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?
No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
Quote
mov ebx, 99 ; loop 100x <<<< THIS IS FOR real4, mem, mem ?
align 4
.Repeat
fld MyR4
fld MyR4 <<<<<<<<<<
fstp st
fstp st
dec ebx
.Until Sign?
-----------------------------
mov ebx, 99 <<<<< this is for real4, mem, st ?
align 4
.Repeat
fld MyR4
fld st <<<<<<<<<<<<<<<<
fstp st
fstp st
dec ebx
.Until Sign?
Good work
Jochen :t
Is this what you are doing, correct ?
Now i want to say that i never used real4 or real 8. Only real 10.
Quote
Intel(R) Atom(TM) CPU N455 @ 1.66GHz (SSE4)
624 cycles for 100 * fld Real4 mem, mem
596 cycles for 100 * fld Real4 mem, st
603 cycles for 100 * fld Real8 mem, mem
595 cycles for 100 * fld Real8 mem, st
1426 cycles for 100 * fld Real10 mem, mem
1015 cycles for 100 * fld Real10 mem, st
604 cycles for 100 * fld Real4 mem, mem
595 cycles for 100 * fld Real4 mem, st
600 cycles for 100 * fld Real8 mem, mem
597 cycles for 100 * fld Real8 mem, st
1427 cycles for 100 * fld Real10 mem, mem
1002 cycles for 100 * fld Real10 mem, st
613 cycles for 100 * fld Real4 mem, mem
595 cycles for 100 * fld Real4 mem, st
615 cycles for 100 * fld Real8 mem, mem
599 cycles for 100 * fld Real8 mem, st
1416 cycles for 100 * fld Real10 mem, mem
1355 cycles for 100 * fld Real10 mem, st
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
-
F:\TEMP\TEST>fld_mem_
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
293 cycles for 100 * fld Real4 mem, mem
192 cycles for 100 * fld Real4 mem, st
298 cycles for 100 * fld Real8 mem, mem
186 cycles for 100 * fld Real8 mem, st
508 cycles for 100 * fld Real10 mem, mem
423 cycles for 100 * fld Real10 mem, st
290 cycles for 100 * fld Real4 mem, mem
193 cycles for 100 * fld Real4 mem, st
292 cycles for 100 * fld Real8 mem, mem
192 cycles for 100 * fld Real8 mem, st
508 cycles for 100 * fld Real10 mem, mem
423 cycles for 100 * fld Real10 mem, st
289 cycles for 100 * fld Real4 mem, mem
198 cycles for 100 * fld Real4 mem, st
296 cycles for 100 * fld Real8 mem, mem
193 cycles for 100 * fld Real8 mem, st
512 cycles for 100 * fld Real10 mem, mem
425 cycles for 100 * fld Real10 mem, st
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.
What does that teach us? Nothing 8)
Making fast routines isn't easy in modern times. :(
We need a database for optimizations per system. :bgrin:
> What does that teach us?
Balanced algorithms with testing spread across mixed hardware.
Quote from: Siekmanski on April 26, 2018, 12:34:23 AM
Making fast routines isn't easy in modern times. :(
We need a database for optimizations per system. :bgrin:
make a similar code in masm, like JAVA's Just In Time (JIT) compiler
that first runs and times first code snippet,after that runs second version of code snippet etc and finally compare what runs fastest and keep code to only jump to that version of the code and put up a flag that signals that testing which is fastest is over
Nah, I prefer Hutches approach.
Quote from: Siekmanski on April 26, 2018, 07:06:26 AM
Nah, I prefer Hutches approach.
Me too :t
On your i7-4930K the results are, more or less,
373 cycles for real10, in both cases...
Here it is another
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
193 cycles for 100 * fld Real4 mem, mem
174 cycles for 100 * fld Real4 mem, st
178 cycles for 100 * fld Real8 mem, mem
162 cycles for 100 * fld Real8 mem, st
1644 cycles for 100 * fld Real10 mem, mem
838 cycles for 100 * fld Real10 mem, st
------------------------
177 cycles for 100 * fld Real4 mem, mem
158 cycles for 100 * fld Real4 mem, st
177 cycles for 100 * fld Real8 mem, mem
174 cycles for 100 * fld Real8 mem, st
1644 cycles for 100 * fld Real10 mem, mem
847 cycles for 100 * fld Real10 mem, st
------------------------
193 cycles for 100 * fld Real4 mem, mem
175 cycles for 100 * fld Real4 mem, st
179 cycles for 100 * fld Real8 mem, mem
159 cycles for 100 * fld Real8 mem, st
1649 cycles for 100 * fld Real10 mem, mem
838 cycles for 100 * fld Real10 mem, st
------------------------
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
Hi Jochen,
Quote from: jj2007 on April 25, 2018, 11:49:08 PM
Interesting results ::)
- an AMD that does impossible things
- recent Intel i5/i7 where mem access is faster
- mobile cpus like Celeron and Pentium M where fld st is faster.
What does that teach us? Nothing 8)
Quote from: jj2007 on April 25, 2018, 04:45:39 PM
Quote from: HSE on April 25, 2018, 10:17:25 AMWhat happen here? Cycles are so slower than the other machines?
No idea how this can run in 11 cycles (not including the loop overhead, though) ::)
mov ebx, 99 ; loop 100x
align 4
.Repeat
fld MyR4
fld MyR4
fstp st
fstp st
dec ebx
.Until Sign?
An idea occurred to me.* Loading MyR4 twice may be cached
by the FPU? Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any? Sorry if this is a silly question, the AMD results just looked
too odd.
Regards,
Steve N.
* Yes, it does happen. Of course it doesn't imply a good one.
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
11 cycles for 100 * fld Real4 mem, mem
7 cycles for 100 * fld Real4 mem, st
50 cycles for 100 * fld Real8 mem, mem
9 cycles for 100 * fld Real8 mem, st
600 cycles for 100 * fld Real10 mem, mem
309 cycles for 100 * fld Real10 mem, st
10 cycles for 100 * fld Real4 mem, mem
7 cycles for 100 * fld Real4 mem, st
50 cycles for 100 * fld Real8 mem, mem
7 cycles for 100 * fld Real8 mem, st
602 cycles for 100 * fld Real10 mem, mem
309 cycles for 100 * fld Real10 mem, st
9 cycles for 100 * fld Real4 mem, mem
7 cycles for 100 * fld Real4 mem, st
52 cycles for 100 * fld Real8 mem, mem
7 cycles for 100 * fld Real8 mem, st
600 cycles for 100 * fld Real10 mem, mem
309 cycles for 100 * fld Real10 mem, st
16 bytes for fld Real4 mem, mem
12 bytes for fld Real4 mem, st
16 bytes for fld Real8 mem, mem
12 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
Using myR4 and myR4b, myR8 and myR8b, and myR10 and myR10b.
Note that R8 is slower now, but not different with R4.
Quote from: FORTRANS on April 26, 2018, 11:05:12 PMAn idea occurred to me.* Loading MyR4 twice may be cached
by the FPU? Would making a 200 element array, and loading a
different value {say MyR4[X],MyR4[X+400]} change the timing
any? Sorry if this is a silly question, the AMD results just looked
too odd.
Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:
mov ebx, AlgoLoops-1
align 4
.Repeat
fld MyR4[4*ebx]
fld MyR4[4*ebx]
fstp st
fstp st
dec ebx
.Until Sign?
Results with AlgoLoops=10,000:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
20248 cycles for 10000 * fld Real4 mem, mem
30267 cycles for 10000 * fld Real4 mem, st
20704 cycles for 10000 * fld Real8 mem, mem
30563 cycles for 10000 * fld Real8 mem, st
40031 cycles for 10000 * fld Real10 mem, mem
40005 cycles for 10000 * fld Real10 mem, st
20247 cycles for 10000 * fld Real4 mem, mem
30314 cycles for 10000 * fld Real4 mem, st
20708 cycles for 10000 * fld Real8 mem, mem
30512 cycles for 10000 * fld Real8 mem, st
40018 cycles for 10000 * fld Real10 mem, mem
40026 cycles for 10000 * fld Real10 mem, st
20248 cycles for 10000 * fld Real4 mem, mem
30272 cycles for 10000 * fld Real4 mem, st
20755 cycles for 10000 * fld Real8 mem, mem
30492 cycles for 10000 * fld Real8 mem, st
40013 cycles for 10000 * fld Real10 mem, mem
40015 cycles for 10000 * fld Real10 mem, st
In short: My i5 couldn't care less :P
Now the Celeron:
Intel(R) Celeron(R) CPU N2840 @ 2.16GHz (SSE4)
16354 cycles for 10000 * fld Real4 mem, mem
15524 cycles for 10000 * fld Real4 mem, st
18059 cycles for 10000 * fld Real8 mem, mem
17376 cycles for 10000 * fld Real8 mem, st
98249 cycles for 10000 * fld Real10 mem, mem
57020 cycles for 10000 * fld Real10 mem, st
16382 cycles for 10000 * fld Real4 mem, mem
17095 cycles for 10000 * fld Real4 mem, st
18286 cycles for 10000 * fld Real8 mem, mem
17329 cycles for 10000 * fld Real8 mem, st
99281 cycles for 10000 * fld Real10 mem, mem
56412 cycles for 10000 * fld Real10 mem, st
15761 cycles for 10000 * fld Real4 mem, mem
16297 cycles for 10000 * fld Real4 mem, st
17914 cycles for 10000 * fld Real8 mem, mem
18403 cycles for 10000 * fld Real8 mem, st
98288 cycles for 10000 * fld Real10 mem, mem
56802 cycles for 10000 * fld Real10 mem, st
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
1811 cycles for 10000 * fld Real4 mem, mem
1619 cycles for 10000 * fld Real4 mem, st
3760 cycles for 10000 * fld Real8 mem, mem
?? cycles for 10000 * fld Real8 mem, st
55645 cycles for 10000 * fld Real10 mem, mem
32089 cycles for 10000 * fld Real10 mem, st
1313 cycles for 10000 * fld Real4 mem, mem
?? cycles for 10000 * fld Real4 mem, st
5816 cycles for 10000 * fld Real8 mem, mem
1112 cycles for 10000 * fld Real8 mem, st
61808 cycles for 10000 * fld Real10 mem, mem
31172 cycles for 10000 * fld Real10 mem, st
?? cycles for 10000 * fld Real4 mem, mem
?? cycles for 10000 * fld Real4 mem, st
3241 cycles for 10000 * fld Real8 mem, mem
?? cycles for 10000 * fld Real8 mem, st
61940 cycles for 10000 * fld Real10 mem, mem
27024 cycles for 10000 * fld Real10 mem, st
18 bytes for fld Real4 mem, mem
13 bytes for fld Real4 mem, st
18 bytes for fld Real8 mem, mem
13 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
8)
now with array in .data?
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
289 cycles for 10000 * fld Real4 mem, mem
1526 cycles for 10000 * fld Real4 mem, st
1665 cycles for 10000 * fld Real8 mem, mem
2670 cycles for 10000 * fld Real8 mem, st
51864 cycles for 10000 * fld Real10 mem, mem
25661 cycles for 10000 * fld Real10 mem, st
270 cycles for 10000 * fld Real4 mem, mem
1552 cycles for 10000 * fld Real4 mem, st
726 cycles for 10000 * fld Real8 mem, mem
2681 cycles for 10000 * fld Real8 mem, st
61508 cycles for 10000 * fld Real10 mem, mem
25214 cycles for 10000 * fld Real10 mem, st
1824 cycles for 10000 * fld Real4 mem, mem
?? cycles for 10000 * fld Real4 mem, st
177 cycles for 10000 * fld Real8 mem, mem
3478 cycles for 10000 * fld Real8 mem, st
51708 cycles for 10000 * fld Real10 mem, mem
31514 cycles for 10000 * fld Real10 mem, st
18 bytes for fld Real4 mem, mem
13 bytes for fld Real4 mem, st
18 bytes for fld Real8 mem, mem
13 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
fld RealXX mem, mem faster than fld RealXX mem, st ??
?? cycles ... ? what it means ?
Quote from: RuiLoureiro on April 27, 2018, 12:04:37 AM
?? cycles ... ? what it means ?
It means the timing procedure could not establish a valid value. There is something strange with the AMD:
289 cycles for 10000 * fld Real4 mem, mem
This is simply impossible. My template uses cpuid + rdtsc from Michael Webster's timer macros (http://the%20http://masm32.com/board/index.php?topic=49.0); no idea what could happen there:
counter_begin TimerLoops, HIGH_PRIORITY_CLASS
call TestA
counter_end
ShowCycles TestA
Could it be that the loop overhead is wrong, or that the spinup loop is too short? Attached a
special version for HSE that shows the overhead, and uses once a much longer spinup loop (20x):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
loop overhead is approx. 10067/10000 cycles
20191 cycles for 10000 * fld Real4 mem, mem
30011 cycles for 10000 * fld Real4 mem, st
20753 cycles for 10000 * fld Real8 mem, mem
30031 cycles for 10000 * fld Real8 mem, st
39976 cycles for 10000 * fld Real10 mem, mem
40001 cycles for 10000 * fld Real10 mem, st
Hi,
Thanks for testing.
Quote from: jj2007 on April 26, 2018, 11:34:14 PM
Steve,
That is definitely not a silly question, so I tested it with randomly filled arrays:
fld MyR4[4*ebx]
fld MyR4[4*ebx]
But again you load the same value twice in a row. That's why I
put in a +400 for the second load. But HSE's results show that
probably would not change things on his AMD.
Thanks,
Steve
AMD Phenom(tm) II X6 1045T Processor (SSE3)
Spinup done
loop overhead is approx. 20057/10000 cycles
879 cycles for 10000 * fld Real4 mem, mem
6518 cycles for 10000 * fld Real4 mem, st
3628 cycles for 10000 * fld Real8 mem, mem
4023 cycles for 10000 * fld Real8 mem, st
60146 cycles for 10000 * fld Real10 mem, mem
47240 cycles for 10000 * fld Real10 mem, st
841 cycles for 10000 * fld Real4 mem, mem
1091 cycles for 10000 * fld Real4 mem, st
3911 cycles for 10000 * fld Real8 mem, mem
5221 cycles for 10000 * fld Real8 mem, st
60101 cycles for 10000 * fld Real10 mem, mem
43169 cycles for 10000 * fld Real10 mem, st
695 cycles for 10000 * fld Real4 mem, mem
870 cycles for 10000 * fld Real4 mem, st
4738 cycles for 10000 * fld Real8 mem, mem
4917 cycles for 10000 * fld Real8 mem, st
60153 cycles for 10000 * fld Real10 mem, mem
35096 cycles for 10000 * fld Real10 mem, st
18 bytes for fld Real4 mem, mem
13 bytes for fld Real4 mem, st
18 bytes for fld Real8 mem, mem
13 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
-
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
Spinup done
loop overhead is approx. 10454/10000 cycles
20329 cycles for 10000 * fld Real4 mem, mem
29593 cycles for 10000 * fld Real4 mem, st
21202 cycles for 10000 * fld Real8 mem, mem
29582 cycles for 10000 * fld Real8 mem, st
39593 cycles for 10000 * fld Real10 mem, mem
39595 cycles for 10000 * fld Real10 mem, st
19664 cycles for 10000 * fld Real4 mem, mem
29596 cycles for 10000 * fld Real4 mem, st
19650 cycles for 10000 * fld Real8 mem, mem
29598 cycles for 10000 * fld Real8 mem, st
39594 cycles for 10000 * fld Real10 mem, mem
39597 cycles for 10000 * fld Real10 mem, st
19614 cycles for 10000 * fld Real4 mem, mem
29630 cycles for 10000 * fld Real4 mem, st
19701 cycles for 10000 * fld Real8 mem, mem
29659 cycles for 10000 * fld Real8 mem, st
39591 cycles for 10000 * fld Real10 mem, mem
39590 cycles for 10000 * fld Real10 mem, st
18 bytes for fld Real4 mem, mem
13 bytes for fld Real4 mem, st
18 bytes for fld Real8 mem, mem
13 bytes for fld Real8 mem, st
16 bytes for fld Real10 mem, mem
12 bytes for fld Real10 mem, st
--- ok ---
Quote from: FORTRANS on April 27, 2018, 12:44:51 AMBut again you load the same value twice in a row.
OK, since you insist :P
fld MyR4[4*ebx]
fld MyR4[4*ebx+4*2000]
Note that Jim's AMD needs twice as much time for the naked loop; it could be that the AMD does loop and fpu loads simultaneously. Agner (http://www.agner.org/optimize/blog/read.php?i=838) says the latest Ryzen is faster:
QuoteThis makes it possible to execute a tiny loop with up to six instructions in one clock cycle per iteration
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Spinup done, 4,200,000,000*dec eax
loop overhead is approx. 10112/10000 cycles
20271 cycles for 10000 * fld Real4 mem, mem
30139 cycles for 10000 * fld Real4 mem, st
20586 cycles for 10000 * fld Real8 mem, mem
30194 cycles for 10000 * fld Real8 mem, st
42013 cycles for 10000 * fld Real10 mem, mem
40749 cycles for 10000 * fld Real10 mem, st
20330 cycles for 10000 * fld Real4 mem, mem
30232 cycles for 10000 * fld Real4 mem, st
20648 cycles for 10000 * fld Real8 mem, mem
30324 cycles for 10000 * fld Real8 mem, st
41906 cycles for 10000 * fld Real10 mem, mem
40546 cycles for 10000 * fld Real10 mem, st
20365 cycles for 10000 * fld Real4 mem, mem
30313 cycles for 10000 * fld Real4 mem, st
20606 cycles for 10000 * fld Real8 mem, mem
30312 cycles for 10000 * fld Real8 mem, st
42380 cycles for 10000 * fld Real10 mem, mem
41474 cycles for 10000 * fld Real10 mem, st
Hi Jochen,
Thank you again. Now my results have changed. So I am
confused once more.
F:\TEMP\TEST>FLD_MEM_.exe
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
Spinup done, 4,200,000,000*dec eax in 2695 ms
loop overhead is approx. 10082/10000 cycles
22465 cycles for 10000 * fld Real4 mem, mem
25357 cycles for 10000 * fld Real4 mem, st
27534 cycles for 10000 * fld Real8 mem, mem
28920 cycles for 10000 * fld Real8 mem, st
75225 cycles for 10000 * fld Real10 mem, mem
52567 cycles for 10000 * fld Real10 mem, st
22216 cycles for 10000 * fld Real4 mem, mem
26432 cycles for 10000 * fld Real4 mem, st
25397 cycles for 10000 * fld Real8 mem, mem
27759 cycles for 10000 * fld Real8 mem, st
70948 cycles for 10000 * fld Real10 mem, mem
51487 cycles for 10000 * fld Real10 mem, st
20858 cycles for 10000 * fld Real4 mem, mem
25416 cycles for 10000 * fld Real4 mem, st
24065 cycles for 10000 * fld Real8 mem, mem
28251 cycles for 10000 * fld Real8 mem, st
70658 cycles for 10000 * fld Real10 mem, mem
51233 cycles for 10000 * fld Real10 mem, st
--- ok ---
At least the Real10 results did not swap places.
Regards,
Steve N.
Results for the last version
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Spinup done, 4,200,000,000*dec eax in 2041 ms
loop overhead is approx. 18160/10000 cycles
33459 cycles for 10000 * fld Real4 mem, mem
37247 cycles for 10000 * fld Real4 mem, st
49295 cycles for 10000 * fld Real8 mem, mem
45967 cycles for 10000 * fld Real8 mem, st
186538 cycles for 10000 * fld Real10 mem, mem
108538 cycles for 10000 * fld Real10 mem, st
31169 cycles for 10000 * fld Real4 mem, mem
37200 cycles for 10000 * fld Real4 mem, st
41419 cycles for 10000 * fld Real8 mem, mem
46164 cycles for 10000 * fld Real8 mem, st
184284 cycles for 10000 * fld Real10 mem, mem
108466 cycles for 10000 * fld Real10 mem, st
31042 cycles for 10000 * fld Real4 mem, mem
37448 cycles for 10000 * fld Real4 mem, st
42335 cycles for 10000 * fld Real8 mem, mem
46058 cycles for 10000 * fld Real8 mem, st
184647 cycles for 10000 * fld Real10 mem, mem
110142 cycles for 10000 * fld Real10 mem, st
--- ok ---
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles
10047 cycles for 10000 * fld Real4 mem, mem
7442 cycles for 10000 * fld Real4 mem, st
13013 cycles for 10000 * fld Real8 mem, mem
9412 cycles for 10000 * fld Real8 mem, st
65984 cycles for 10000 * fld Real10 mem, mem
27897 cycles for 10000 * fld Real10 mem, st
7320 cycles for 10000 * fld Real4 mem, mem
11115 cycles for 10000 * fld Real4 mem, st
12497 cycles for 10000 * fld Real8 mem, mem
12401 cycles for 10000 * fld Real8 mem, st
63567 cycles for 10000 * fld Real10 mem, mem
33683 cycles for 10000 * fld Real10 mem, st
7254 cycles for 10000 * fld Real4 mem, mem
10440 cycles for 10000 * fld Real4 mem, st
12343 cycles for 10000 * fld Real8 mem, mem
7372 cycles for 10000 * fld Real8 mem, st
63265 cycles for 10000 * fld Real10 mem, mem
30161 cycles for 10000 * fld Real10 mem, st
--- ok ---
What are you guys using to run this?
The exe attached to Reply #65.
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
Spinup done, 4,200,000,000*dec eax in 3757 ms
loop overhead is approx. 19715/10000 cycles
So the AMD takes a long time to get to full speed, and has a high loop overhead. Hm :(
Cool :t
16575 cycles for 10000 * fld Real4 mem, mem
18297 cycles for 10000 * fld Real4 mem, st
26425 cycles for 10000 * fld Real8 mem, mem
19634 cycles for 10000 * fld Real8 mem, st
79024 cycles for 10000 * fld Real10 mem, mem
42922 cycles for 10000 * fld Real10 mem, st
16270 cycles for 10000 * fld Real4 mem, mem
16712 cycles for 10000 * fld Real4 mem, st
24982 cycles for 10000 * fld Real8 mem, mem
17107 cycles for 10000 * fld Real8 mem, st
77920 cycles for 10000 * fld Real10 mem, mem
44588 cycles for 10000 * fld Real10 mem, st
17735 cycles for 10000 * fld Real4 mem, mem
16897 cycles for 10000 * fld Real4 mem, st
23506 cycles for 10000 * fld Real8 mem, mem
23064 cycles for 10000 * fld Real8 mem, st
79142 cycles for 10000 * fld Real10 mem, mem
41752 cycles for 10000 * fld Real10 mem, st
Pipeline?
Sorry Lone I wass asking other thing to JJ.
Pipeline is the nickname of a RISC processors feature.They process an instruction, search memory of the next instruction and read the following instruction in the same cycle (or something like that :biggrin:).