Use of EMMS Instruction

dedndave · July 28, 2014, 10:22:06 AM

looking at Michael Webster's timing macros.....

don't know if nidud and Rui have been using any MMX code (don't think so - haven't been watching that closely)
trying to understand 2 things about the EMMS instruction....

1) if we are going to perform FINIT, anyways, what purpose would EMMS serve ?
2) if we are not using any MMX instructions, what is the harm in using EMMS before FPU ?

https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_mmx_emms_why.htm

MichaelW · July 28, 2014, 01:08:48 PM

1) No purpose that I can see, but if all you actually need from the FINIT is to set the tag word to empty EMMS should provide a faster way to do that.

2) Assuming that setting the tag word to empty is no problem, then no harm other than lost clock cycles.

Per Intel EMMS:

Quote
Sets the values of all the tags in the x87 FPU tag word to empty (all 1s). This operation marks the x87 FPU data registers (which are aliased to the MMX technology registers) as available for use by x87 FPU floating-point instructions. All other MMX instructions (other than the EMMS instruction) set all the tags in x87 FPU tag word to valid (all 0s).

Code Select


;===============================================================================
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
;===============================================================================
.data
.code
start:

    invoke Sleep, 6000
  
    counter_begin 1000000, REALTIME_PRIORITY_CLASS
        finit
    counter_end
    printf("%d cycles, finit\n", eax)
    
    counter_begin 1000000, REALTIME_PRIORITY_CLASS
        emms
    counter_end
    printf("%d cycles, emms\n\n", eax)

    inkey
    exit
    
end start

Running the above on a Core i3 under Windows 7 (the only system I have set up ATM) I get a consistent:

Code Select


80 cycles, finit
11 cycles, emms

dedndave · July 28, 2014, 01:37:53 PM

thanks, Michael

i may not need it, anyways
but, i wanted to better understand the proper use of the instruction :t

yah - it seems that FINIT takes care of the tag word :P

jj2007 · July 28, 2014, 06:22:39 PM

It's difficult to find an intelligent usage for MMX instructions. They became obsolete in 1999, when SSE was introduced. Therefore, any program that uses the FPU should simply start with finit, and that's it... the 80 cycles would play a role if it was used in a loop with a Million iterations, but no sane programmer would do that.

Gunther · July 28, 2014, 10:26:14 PM

Quote from: jj2007 on July 28, 2014, 06:22:39 PM
It's difficult to find an intelligent usage for MMX instructions. They became obsolete in 1999, when SSE was introduced. Therefore, any program that uses the FPU should simply start with finit, and that's it... the 80 cycles would play a role if it was used in a loop with a Million iterations, but no sane programmer would do that.

Not necessarily. One can use MMX for temporary storage of registers and RAM copy for older machines.

Gunther

jj2007 · July 28, 2014, 10:33:43 PM

Quote from: Gunther on July 28, 2014, 10:26:14 PMOne can use MMX for temporary storage of registers and RAM copy for older machines.

Gunther,
If you are patient enough to "work" with a fifteen years old machine, then rep movsd will be fast enough, too.

MichaelW · July 29, 2014, 01:30:58 AM

How about if you are poor enough to work on a 15 year-old machine?

My P3 is not available ATM, but running this code:

Code Select


;===============================================================================
include \masm32\include\masm32rt.inc
.686
.mmx
include \masm32\macros\timers.asm
;===============================================================================
.data
    buff1  dd 1024 dup (0)    
    buff2  dd 1000 dup (0)
.code
start:

    invoke Sleep, 6000
  
    counter_begin 1000000, REALTIME_PRIORITY_CLASS
        mov   esi, OFFSET buff1
        mov   edi, OFFSET buff2
        mov   ecx, 1000
        rep   movsd        
    counter_end
    printf("%d cycles, rep movsd * 1000\n", eax)
    
    counter_begin 1000000, REALTIME_PRIORITY_CLASS        
        mov   esi, OFFSET buff1
        mov   edi, OFFSET buff2
        mov   ecx, 499
      @@:  
        movq  mm0, [esi+ecx*8]
        movq  [edi+ecx*8], mm0
        dec   ecx
        jns   @B        
    counter_end
    printf("%d cycles, movq * 500 \n\n", eax)

    inkey
    exit
    
end start

On the Core i3 I get a fairly consistent:

Code Select


170 cycles, rep movsd * 1000
559 cycles, movq * 500

I didn't have time to redo the movq code so it processes from lower addresses to higher addresses, but I doubt that this change would make up the difference.

FORTRANS · July 29, 2014, 02:39:14 AM

Hi,

P-III data.

Code Select

19 cycles, finit
2 cycles, emms

Press any key to continue ...

Code Select

903 cycles, rep movsd * 1000
1520 cycles, movq * 500

Press any key to continue ...

Regards,

Steve

The MASM Forum

News:

Use of EMMS Instruction

dedndave

MichaelW

dedndave

jj2007

Gunther

jj2007

MichaelW

FORTRANS