looking at Michael Webster's timing macros.....
don't know if nidud and Rui have been using any MMX code (don't think so - haven't been watching that closely)
trying to understand 2 things about the EMMS instruction....
1) if we are going to perform FINIT, anyways, what purpose would EMMS serve ?
2) if we are not using any MMX instructions, what is the harm in using EMMS before FPU ?
https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_mmx_emms_why.htm (https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_mmx_emms_why.htm)
1) No purpose that I can see, but if all you actually need from the FINIT is to set the tag word to empty EMMS should provide a faster way to do that.
2) Assuming that setting the tag word to empty is no problem, then no harm other than lost clock cycles.
Per Intel EMMS:
Quote
Sets the values of all the tags in the x87 FPU tag word to empty (all 1s). This operation marks the x87 FPU data registers (which are aliased to the MMX technology registers) as available for use by x87 FPU floating-point instructions. All other MMX instructions (other than the EMMS instruction) set all the tags in x87 FPU tag word to valid (all 0s).
;===============================================================================
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
;===============================================================================
.data
.code
start:
invoke Sleep, 6000
counter_begin 1000000, REALTIME_PRIORITY_CLASS
finit
counter_end
printf("%d cycles, finit\n", eax)
counter_begin 1000000, REALTIME_PRIORITY_CLASS
emms
counter_end
printf("%d cycles, emms\n\n", eax)
inkey
exit
end start
Running the above on a Core i3 under Windows 7 (the only system I have set up ATM) I get a consistent:
80 cycles, finit
11 cycles, emms
thanks, Michael
i may not need it, anyways
but, i wanted to better understand the proper use of the instruction :t
yah - it seems that FINIT takes care of the tag word :P
It's difficult to find an intelligent usage for MMX instructions. They became obsolete in 1999, when SSE was introduced. Therefore, any program that uses the FPU should simply start with finit, and that's it... the 80 cycles would play a role if it was used in a loop with a Million iterations, but no sane programmer would do that.
Quote from: jj2007 on July 28, 2014, 06:22:39 PM
It's difficult to find an intelligent usage for MMX instructions. They became obsolete in 1999, when SSE was introduced. Therefore, any program that uses the FPU should simply start with finit, and that's it... the 80 cycles would play a role if it was used in a loop with a Million iterations, but no sane programmer would do that.
Not necessarily. One can use MMX for temporary storage of registers and RAM copy for older machines.
Gunther
Quote from: Gunther on July 28, 2014, 10:26:14 PMOne can use MMX for temporary storage of registers and RAM copy for older machines.
Gunther,
If you are patient enough to "work" with a fifteen years old machine, then rep movsd will be fast enough, too.
How about if you are poor enough to work on a 15 year-old machine?
My P3 is not available ATM, but running this code:
;===============================================================================
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
;===============================================================================
.data
buff1 dd 1024 dup (0)
buff2 dd 1000 dup (0)
.code
start:
invoke Sleep, 6000
counter_begin 1000000, REALTIME_PRIORITY_CLASS
mov esi, OFFSET buff1
mov edi, OFFSET buff2
mov ecx, 1000
rep movsd
counter_end
printf("%d cycles, rep movsd * 1000\n", eax)
counter_begin 1000000, REALTIME_PRIORITY_CLASS
mov esi, OFFSET buff1
mov edi, OFFSET buff2
mov ecx, 499
@@:
movq mm0, [esi+ecx*8]
movq [edi+ecx*8], mm0
dec ecx
jns @B
counter_end
printf("%d cycles, movq * 500 \n\n", eax)
inkey
exit
end start
On the Core i3 I get a fairly consistent:
170 cycles, rep movsd * 1000
559 cycles, movq * 500
I didn't have time to redo the movq code so it processes from lower addresses to higher addresses, but I doubt that this change would make up the difference.
Hi,
P-III data.
19 cycles, finit
2 cycles, emms
Press any key to continue ...
903 cycles, rep movsd * 1000
1520 cycles, movq * 500
Press any key to continue ...
Regards,
Steve