News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Finding stalls

Started by LarryC, March 03, 2015, 02:35:53 AM

Previous topic - Next topic

LarryC

I have some code I suspect is stalling quite a bit.  I'm reading Agner Fog's Instruction tables and having a bit of trouble wrapping my head around what exactly each column means and how it is used.

Ops - Is this straight up clock cycles to execute?

Latency - Is this the amount of time between when an instruction is read and it starts executing?

Reciprocal throughput - Is this the delay before another instruction using this execution unit can start?  For example: 1/3 means I can execute 3 instructions in the same clock cycle if they use the same execution unit, but all three will take 1 cycle?

So if I look at move I see:
Ops: 1
Latency: 1
Reciprocal Throughput: 1/3

So for MOV which uses the ALU:
; 1 cycle latency, 1 cycle to execute (2 moves wasted) total 2 clock cycles (does this introduce a stall if I don't use the other two moves?)
MOV EAX, EBX 

; 1 cycle latency, 1 cycle to execute (1 moves wasted) total 2 clock cycles (does this introduce a stall if I don't use the other move?)
MOV EAX, EBX
MOV ECX, EDX

; 1 cycle latency, 1 cycled to execute (0 moves wasted) total 2 clock cycles
MOV EAX, EBX
MOV ECX, EDX
MOV ESI, EDI

Loading from memory also uses AGU (Address Generation Unit) so if I were loading from memory both of these units would be occupied but I'd use the values the same in terms of ops, latency, and reciprocal throughput?

Also if my destination is modified, does that impact stalling?
MOV EAX, EBX
ADD EAX, EDX  ; Does this create an issue since EAX was modified on last instruction?
SUB EAX, ECX  ; ....

dedndave

http://www.agner.org/optimize/instruction_tables.pdf

page 3 has a pretty good description of the terms

the cases you cite are all register-register operations
there isn't a lot of delay, here - and not a lot to be done about it

i often try to "interlace" a series of instructions to help, though

mov eax,dwSomeValue1
add eax,3
sub eax,dwSomeValue2
mov edx,dwSomeValue3
add edx,5
sub edx,dwSomeValue4


might execute a little faster as...

mov eax,dwSomeValue1
mov edx,dwSomeValue3
add eax,3
add edx,5
sub eax,dwSomeValue2
sub edx,dwSomeValue4

hutch--

Hi Larry,

Tracking down stalls has some to do with understanding what they are and why they happen. As of the 486 Intel introduced the notion of a pipeline to schedule instructions going into the processing units. In some of the later hardware, the PIV being typical, the pipeline got a lot longer so they could wind the clock rate up higher. The late PIVs (Prescott) had very long pipelines and if one instruction in one pipeline depended on the result of another instruction in the other pipeline, the pipeline had to stop until the dependency had been processed. Effectively the longer the pipeline, the worse the stall.

You avoid much of this by instruction choice, mov, sub, add, test, cmp etc .... while trying not to use many of the older more complex instructions which are often consigned to microcode rather than the fastest paths in silicon. The problem is that much of this data is hardware dependent and varies from one processor to another. The later Core2 and i3/5/7 families are smarter in their scheduling and have a lot less problems in terms of stalls than some of the earlier processors.

What you may call the "preferred" instruction set go through multiple pipelines without any problems, the term is pairing, where many of the older instructions only went through one pipeline and code dependent on its result had to wait for the pipeline to clear to continue. This resulted in severe stalls. Your friend here is the clock and if you have an instruction choice or order that does not pair through multiple pipelines, you will see the difference in timing.

Gunther

Hi Larry,

Hutch and Dave gave excellent explanitions for you. There are only two ideas to add. Agner Fog summarizees in chapter 16 of the manual Optimizing subroutines in assembly language. An optimization guide for x86 platforms the problematic instructions. It's worth reading. Furthermore, assume nothing, test your code on different machines with different processors. You'll find a lot of help in the forum.

Gunther
You have to know the facts before you can distort them.

dedndave

here's how we do it, Larry...

http://masm32.com/board/index.php?topic=3904.0

write a test piece and have several people run it in the Laboratory subforum
that way, you get to see how different algorithms run on different platforms

we often use Michael Webster's code timing macros

http://masm32.com/board/index.php?topic=49.0

hutch--

I think Larry is using the inline assembler in PB which uses the same mnemonic notation as MASM in most instances but it will not run Michael's timing macros so what I would recommend is using GetTickCount() on a reasonable sized sample that is designed to be something like the task it is going to perform then trying out different instruction combinations to see if there is any timing difference.

I tend to use this type of code in PB when timing.


LOCAL tcnt as DWORD
......
tcnt = GetTickCount

  ' run the OP for 500 ms or longer

tcnt = GetTickCount - tcnt

StdOut format$(tcnt)+" MS"

Gunther

Quote from: hutch-- on March 03, 2015, 11:21:37 AM

LOCAL tcnt as DWORD
......
tcnt = GetTickCount

  ' run the OP for 500 ms or longer

tcnt = GetTickCount - tcnt

StdOut format$(tcnt)+" MS"


for working with PB's inline assembler is that a reasonable solution.  :t The tricky thing is that it's not possible to link external assembly language procedures into the PB exe (only true for the Windows version, the DOS version has this feature). A workaround would be the usage of a DLL, but that's a bit awkward.

Gunther
You have to know the facts before you can distort them.

hutch--

Gunther,

The main win is its mnemonic compatible so you can copy a no stack frame algo from MASM and put it into a PB FASTPROC with no loss. This is a straight steal from my MASM library. Timing the algo with GetTickCount is trivial.


' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

' szcopy proc src:DWORD,dst:DWORD  ; the MASM proc form.

FASTPROC szcopy

    #REGISTER NONE
    PREFIX "!"

    push esi
    push edi

    mov edx, [esp+12]
    mov edi, [esp+16]
    mov eax, -1
    mov esi, 1

  lbl:
    add eax, esi
    movzx ecx, BYTE PTR [edx+eax]
    mov [edi+eax], cl
    test ecx, ecx
    jnz lbl

    pop edi
    pop esi

    ret 8

    END PREFIX

END FASTPROC

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Gunther

Hutch,

good thing. I'll take it into my software repository. But what is with a stack frame algorithm? I bet, you've a solution, too.

Gunther
You have to know the facts before you can distort them.

hutch--

Gunther,

You can do either. A normal high level PB function has leading and trailing stack overhead to support the high level aspects of basic which is trivial for high level language but with assembler algorithms you have the option of writing your own or not using a stack frame. Here are two templaes I keep in my PB copy of QE to save typing.

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FASTPROC [stackframe]

    PREFIX "!"

    push ebp                    ; preserve base pointer
    mov ebp, esp                ; stack pointer into ebp
    sub esp, 64                 ; set the LOCAL space you need in BYTES

  ; args start at [ebp+8]



    mov esp, ebp                ; restore stack pointer
    pop ebp                     ; restore base pointer

    ret [bytecount]             ; matches arg count * 4

    END PREFIX

END FASTPROC

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FASTPROC [nostackframe]

    PREFIX "!"

    push ebx                        ; push adds 4 bytes to stack address
    push esi                        ; push adds 4 bytes to stack address
    push edi                        ; push adds 4 bytes to stack address

  ; --------------------------------
  ; args start at [esp+4] plus byte
  ; count of combined PUSH mnemonics
  ; EG: [esp+4][12]
  ; --------------------------------

    pop edi                         ; restore EDI
    pop esi                         ; restore ESI
    pop ebx                         ; restore EBX

    ret [bytecount]                 ; balance stack, arg count * 4

    END PREFIX

END FASTPROC

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤