News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Use of EMMS Instruction

Started by dedndave, July 28, 2014, 10:22:06 AM

Previous topic - Next topic

dedndave

looking at Michael Webster's timing macros.....

don't know if nidud and Rui have been using any MMX code (don't think so - haven't been watching that closely)
trying to understand 2 things about the EMMS instruction....

1) if we are going to perform FINIT, anyways, what purpose would EMMS serve ?
2) if we are not using any MMX instructions, what is the harm in using EMMS before FPU ?

https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_mmx_emms_why.htm

MichaelW

1) No purpose that I can see, but if all you actually need from the FINIT is to set the tag word to empty EMMS should provide a faster way to do that.

2) Assuming that setting the tag word to empty is no problem, then no harm other than lost clock cycles.

Per Intel EMMS:
Quote
Sets the values of all the tags in the x87 FPU tag word to empty (all 1s). This operation marks the x87 FPU data registers (which are aliased to the MMX technology registers) as available for use by x87 FPU floating-point instructions. All other MMX instructions (other than the EMMS instruction) set all the tags in x87 FPU tag word to valid (all 0s).


;===============================================================================
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
;===============================================================================
.data
.code
start:

    invoke Sleep, 6000
 
    counter_begin 1000000, REALTIME_PRIORITY_CLASS
        finit
    counter_end
    printf("%d cycles, finit\n", eax)
   
    counter_begin 1000000, REALTIME_PRIORITY_CLASS
        emms
    counter_end
    printf("%d cycles, emms\n\n", eax)

    inkey
    exit
   
end start


Running the above on a Core i3 under Windows 7 (the only system I have set up ATM) I get a consistent:

80 cycles, finit
11 cycles, emms







Well Microsoft, here's another nice mess you've gotten us into.

dedndave

thanks, Michael

i may not need it, anyways
but, i wanted to better understand the proper use of the instruction   :t

yah - it seems that FINIT takes care of the tag word   :P

jj2007

It's difficult to find an intelligent usage for MMX instructions. They became obsolete in 1999, when SSE was introduced. Therefore, any program that uses the FPU should simply start with finit, and that's it... the 80 cycles would play a role if it was used in a loop with a Million iterations, but no sane programmer would do that.

Gunther

Quote from: jj2007 on July 28, 2014, 06:22:39 PM
It's difficult to find an intelligent usage for MMX instructions. They became obsolete in 1999, when SSE was introduced. Therefore, any program that uses the FPU should simply start with finit, and that's it... the 80 cycles would play a role if it was used in a loop with a Million iterations, but no sane programmer would do that.

Not necessarily. One can use MMX for temporary storage of registers and RAM copy for older machines.

Gunther
You have to know the facts before you can distort them.

jj2007

Quote from: Gunther on July 28, 2014, 10:26:14 PMOne can use MMX for temporary storage of registers and RAM copy for older machines.

Gunther,
If you are patient enough to "work" with a fifteen years old machine, then rep movsd will be fast enough, too.

MichaelW

How about if you are poor enough to work on a 15 year-old machine?

My P3 is not available ATM,  but running this code:

;===============================================================================
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
;===============================================================================
.data
    buff1  dd 1024 dup (0)   
    buff2  dd 1000 dup (0)
.code
start:

    invoke Sleep, 6000
 
    counter_begin 1000000, REALTIME_PRIORITY_CLASS
        mov   esi, OFFSET buff1
        mov   edi, OFFSET buff2
        mov   ecx, 1000
        rep   movsd       
    counter_end
    printf("%d cycles, rep movsd * 1000\n", eax)
   
    counter_begin 1000000, REALTIME_PRIORITY_CLASS       
        mov   esi, OFFSET buff1
        mov   edi, OFFSET buff2
        mov   ecx, 499
      @@: 
        movq  mm0, [esi+ecx*8]
        movq  [edi+ecx*8], mm0
        dec   ecx
        jns   @B       
    counter_end
    printf("%d cycles, movq * 500 \n\n", eax)

    inkey
    exit
   
end start


On the Core i3 I get a fairly consistent:

170 cycles, rep movsd * 1000
559 cycles, movq * 500


I didn't have time to redo the movq code so it processes from lower addresses to higher addresses, but I doubt that this change would make up the difference.


Well Microsoft, here's another nice mess you've gotten us into.

FORTRANS

Hi,

   P-III data.

19 cycles, finit
2 cycles, emms

Press any key to continue ...


903 cycles, rep movsd * 1000
1520 cycles, movq * 500

Press any key to continue ...


Regards,

Steve