News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Weird speed differences for seemingly identical code

Started by jj2007, August 27, 2015, 11:44:41 PM

Previous topic - Next topic

jj2007

Hi,

This archive contains two versions of roughly the same code:
- fbsl.exe BenchJJ.fbs (i.e. drag the fbs over the exe)
- BenchJJ.exe

They both have a Switch/Case loop whose disassembly is in the two *.txt files.
Now the odd thing is that one takes 1,000 ms, the other only about 700 ms on my machine; despite their identical look and identical (mis-)alignment.

Before jumping to theories and conclusions, I'd like to see a few timings. The code is not mine, but I am very confident that it's free of malware (more).

Relevant excerpt:
    Invoke GetTickCount
    mov ebx, eax // get initial ticks
    .Repeat
      inc swVar
      .If swVar > 100
        xor swVar, swVar
      .EndIf
     
      .If swVar = 0
        inc ct0
        jmp @F
      .EndIf
      .If swVar = 1
        inc ct1
        jmp @F
      .EndIf
      .If swVar = 2
        inc ct2
        jmp @F
      .EndIf
      .If swVar = 4
        inc ct4
        jmp @F
      .EndIf
      inc ctDef
      @@
     
      inc loopCt
      ; int 3
    .Until loopCt > 200000000
   
    Invoke GetTickCount // get current ticks
    sub eax, ebx // calc tick delta

satpro

Off-topic, but I would like to thank you for providing the link to the BASIC language programming dev forum, basicprogramming.org.  Exactly what I have been looking for with my own project and almost can't believe I haven't at least stumbled over it before now.   :t

dedndave

i see a couple hundred mS difference

fbsl with no firefox running ~950 mS
fbsl with firefox running ~1200 mS

benchjj with no firefox running ~780 mS
benchjj with firefox running ~950 mS

seems a little screwy - lol
the fbsl program does not allow copy/paste of text
maybe you could write a more convenient test bed
probably why you're not seeing more replies

rrr314159

On my slow AMD, they take 2200 vs. 1700 ms.

I see speed differences often that are hard to pin down; very annoying when one is coding for speed in particular. For instance after the timing loops, I print out various arrays. It turned out that changing the printouts (because I'm trying to pin down speed diffs) changes the speeds! It's partly cured by throwing in lots of align statements (especially inner loops, naturally). Obviously one should use a real-time OS, maybe simple DOS, for critical timings, but too much trouble (don't have a lot of time for programming these weeks)
I am NaN ;)

jj2007

Thanks, folks.

It looks as if Mike had the right idea: instruction cache boundary problems.
In fact, if I insert in line 28 of BenchJJ.fbs (after mov edi, 5) nineteen invoke GetTickCount plus 2 nops, then the boundaries are shifted down enough to show a dramatic drop from 1,000 to 604 milliseconds. Umph 8)

rrr314159

I bet if you now modify the code - even after the affected area - you'll have to re-align (usually). And, on another machine - even a similar one - it will be different. Is there a systematic way to handle this? For instance, a system call to tell what type of instruction cache one has, and where the boundaries are?

Another q., do the best C++ compilers (Visual C++ I suppose) handle this problem? I don't hear such complaints from them
I am NaN ;)

jj2007

Good question. So far we never ran into this problem, maybe because our proggies are small - in contrast, the fbsl.exe is half a megabyte, so it is easy to fill a 32k or 64k instruction cache. Still, the exact mechanism is not very clear to me. But obviously, one nop more can make a really big difference.