News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Speed testing algos

Started by LordAdef, March 25, 2017, 04:24:49 PM

Previous topic - Next topic

LordAdef

Hi there, I usually post in my own thread in the Campus, where I´m making this game. But since this is related to timing algos, I thought this should be the right place.

The thing is, I realized I don´t know how to properly time my code to accurately tell if I´m optimizing it right.
There is always some system interference or some fluctuation. I´m sure there must be a way to accurately measure your code.

In the case below, there are 2 algos which I put in the same proggy. The "old" copies values byte per byte, while the 2nd invokes the memcopy to copy dword chunks first, and the remaing goes byte per byte.

I didn´t optmized it so far, and wanted to call for help before I begin. All I did is change the copy behaviour as described above.  :dazzled:

The old one uses this macro called mUNROLL:
QuotemUNROLL MACRO
         PUSH ecx
         mov ecx, tChar
         xor eax, eax
          @@:
            mov [edi], cl
              cmp eax, ebx
         jz @F          
             add edi, 1
         add eax, 1
         jmp @B
         @@:
         POP ecx
      endm

The "new" uses the mUNROLLdword:
QuotemUNROLLdword MACRO
      PUSH edx
      mov eax, ebx                        ; MOD here
      mov ecx, 4
      cdq
      div ecx                              ; result in eax, sobra in edx; MOD

      sub ebx, edx                        ; only dwords here, we do edx afterwards   
      .IF tChar == 64
         mov eax, offset Ln1
      .ELSEIF tChar == 32
         mov eax,  offset Ln2
      .ENDIF
      invoke MemCopy, eax, edi, ebx      ; copying dword sizes
      add edi, ebx                        ; add a line lenght for edi index
      mov bl, [eax]                        ; this is either @ or space
      xor ecx, ecx
   @@:
      cmp ecx, edx
      jz @F
      mov [edi], bl
      add edi, 1
      add ecx, 1          
      jmp @B
   @@:         
      POP edx
      endm

I´m calling both algos from here:
Quote.code
start:
   ;old algo. Copies byte by byte
   printf ("\nNew Algo: ")
   ;NanoTimer()
   invoke pDecompress, offset NewFile
   ;Print Str$("%i us ", NanoTimer(us))
      
   ;New algo. Copies dwords, then the remaining byte by byte. Uses "memcopy"
   printf ("\nOld Algo: ")
   ;NanoTimer()
   invoke pDecompress2, offset NewFile
   ;Print Str$("%i us ", NanoTimer(us))

I tested multiplying the calls by 4 each, 8 and 16. I also inverted the invoking order, which appeared to change things too.
Quote1 call
old   174   197   192   154   268   227   211   212   157   169   162   219   =2
new   192   161!   157!   153x   187!   173!   158!   200!   165   141!   158!   157!   =9

1 call (New 1st)
new   178   212   186   183   170x   153   196   168   169   193   192   =4
old   140!   250   162!   172!   171x   192   253   147!   162!   135!   202   =6

4 calls:
old   515   616   512   588   740   480   518   503   543   532   =4
new   499!   556!   535   558!   735!   678   481!   551   511!   557   =6

8 calls:
old   995   923   937   1096   1083   914   999   886   934   905   973   =3
new   904!   900!   1100   906!   961!   931   902!   902!   1107   894!   917!   =8

16 calls:
old   1799   1783   2352   1907   3213   1845x   1688   1727x   1844   1732   1717   =4
new   1702!   1730!   1763!   1896!   2843!   1849x   1800   1722x   2030   1826   1756   =5

16 calls (New first):   
new   1913   1784   1969   2162   1837   1900   1728   2630   2165   1729   1763   =0
old   1732!   1658!   1778!   1902!   1733!   1688!   1615!   2029!   1638!   1669!   1649!   =11

If you look at my last test for 16 calls, my New algo looses in every take.

jj2007

Use a 1000 calls, as shown below:

      printf ("\nNew Algo: ")
      NanoTimer()
      push 999
      .Repeat
            invoke pDecompress, offset NewFile
            dec stack
      .Until Sign?
      pop edx
      Print Str$("%i ms ", NanoTimer(ms))
           
      ;New algo. Copies dwords, then the remaining byte by byte. Uses "memcopy"
      printf ("\tOld Algo: ")
      NanoTimer()
      push 999
      .Repeat
            invoke pDecompress2, offset NewFile
            dec stack
      .Until Sign?
      pop edx
      Print Str$("%i ms ", NanoTimer(ms))
      endm


New Algo: 129 ms        Old Algo: 132 ms
New Algo: 123 ms        Old Algo: 128 ms
New Algo: 126 ms        Old Algo: 123 ms
New Algo: 127 ms        Old Algo: 118 ms
New Algo: 117 ms        Old Algo: 128 ms
New Algo: 129 ms        Old Algo: 138 ms
New Algo: 123 ms        Old Algo: 124 ms
New Algo: 123 ms        Old Algo: 122 ms
New Algo: 123 ms        Old Algo: 124 ms
New Algo: 120 ms        Old Algo: 122 ms


Results in the 100 ms range are much more reliable than something in the us range.
The timer example in the templates is optimised for such problems btw.

LordAdef

Nice JJ!

dec "stack" is new to me by the way. MB?

jj2007

Quote from: LordAdef on March 25, 2017, 07:38:00 PM
dec "stack" is new to me by the way. MB?

Yes, it's just an equate: stack equ dword ptr [esp]
      push 999
      .Repeat
            invoke pDecompress, offset NewFile
            dec stack   ; same as dec dword ptr [esp]
      .Until Sign?
      pop edx


It will cost an extra cycle in that loop, but that hardly matters, and it's a very simple way to get a counter without using a register. And no risk to end up by accident with an endless loop - I avoid .Until Zero? wherever possible ;)

mineiro

hello sir LordAdef;
I took some time to understand your code, sound to me that you're thinking so hard, you can do this procedure on a more easy way.

To better readable masm have 'struct' keyword, acts like opaque structures.
header_file struct
output_size dd ?
line_count dd ?
line_lenght dd ?
header_file ends

so, on your code you can use do something like:

mov   esi, pstring ;pointer to file
assume esi:ptr header_file
m2m tSum,[esi].output_size
m2m tLine,[esi].line_count
m2m tLineLen,[esi].line_lenght
assume esi:nothing
; add esi,sizeof header_file ;displacement header file
; sub szstring,sizeof header_file ;szstring==szstring-header_file

On your code you have used "mov ecx, DWORD Ptr [esi+8]" to get line_lenght . On line "mov edx, 12 ;counter" you can write "mov edx,sizeof header_file".

You can use some tricks on your code, but you should think on binary base instead of decimal base.

mSWITCHChars MACRO
xor tChar,01100000b
; .IF tChar == "@" ;64 ;1000000b
; mov tChar, " " ;32 ;_100000b
; .ELSEIF tChar == " " ;32
; mov tChar, "@" ;64
; .ENDIF
endm


Other trick is when you're working with multiples of number 2, so, 2,4,8,16,32,64,... . If you write this numbers with binary base you can feel an idea.
        1  1
      10  2
    100  4
  1000  8
10000  16
The number one is walking to left side and inserting zeros at right side, and this is multiplication. So, if you like to multiply something by 4 per example you need only shift the number left by 2, if you like to multiply by 8 you shl number by 3.
So, if walking to left is multiply, walking to right should be divide. So, if you like to divide something by 4 per example you need only shift the number right by 2, if you like to divide by 8 you shr number by 3. This way you get quotient.
But by dividing (shift right) we lost remainder, so, to you get the remainder you can use a mask on original number.
mov eax, ebx  ;ebx holds a number to be divided by 4
mov edx,ebx
shr eax,2 ;quotient, divide by 4
and edx,3 ;remainder, 3 means mask, 00000011b


You're working with sign numbers, so, if you get a negative value you can transform it to positive value by using "neg" instruction.

.IF tChar == "@" ;64; .if tChar == "@@@@"
; mov eax, offset Ln1
.while eax != 0
mov dword ptr [edi],"@@@@"
add edi,4 ;lea edi,[edi+4]
dec eax
.endw
.while edx != 0
mov byte ptr [edi],"@"
inc edi
dec edx
.endw
.ELSE;IF tChar == " " ;32
; mov eax,  offset Ln2
.while eax != 0
mov dword ptr [edi],"    "
add edi,4 ;lea edi,[edi+4]
dec eax
.endw
.while edx != 0
mov byte ptr [edi]," "
inc edi
dec edx
.endw

.ENDIF
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

LordAdef

Master of the Golden Lands of Minas Gerais, Mr. Mineiro!!

Thanks my friend.
I´m having a lot of fun only dealing with asm without having to deal with Windows....

I´m prototyping these algos and making them to work first. Then, I´ll be tweaking and fiddling with it so to optimize it for speed.

I was having issues to time my algos, that´s why I wrote this thread. Without timing it properly I could not possibly judge anything. JJ has given me the timing tip above and now it´s fun time.

"NEG" => I have forgot about this one!!!! Thanks, I was wasting some valuable cycles there

SHR/SHL I´ve read somewhere these are kind of slower instructions. I guess there must be a cost relation and my extra code is in fact slower. I´ll check them

Please, keep it coming!
Cheers
Alex

jj2007

Hi Alex,

You should isolate the file load times from your algos. See the bookmarks in the attachment for an example.

In your case, it doesn't matter a lot because the algos have a lot of work to do, but for fast short algos it makes a big difference if you are just working on a buffer, or if the buffer needs to be filled from disk before the short work starts.

(in conditional assembly, ife means "if equal" or "if zero" - see free esi)

LordAdef

Dear JJ,

I´ll check this out now!
I must thank you, you can´t imagine how useful your timing snippet above is being right now! I was missing that.

I´ve already started to fiddle with the algo:
- mUNROLLdword  macro was slower so I got back mUNROLL and am optimizing it. 
- NEG was clearly faster than my maths
- SHR/SHL made a negative effect on the code, slower
- Working on optimizing  reg usage, with good results

So far:
Quote
New Algo: 103 ms       Old Algo: 97 ms
New Algo: 102 ms       Old Algo: 96 ms
New Algo: 102 ms       Old Algo: 97 ms
New Algo: 105 ms       Old Algo: 97 ms
New Algo: 108 ms       Old Algo: 100 ms
New Algo: 109 ms       Old Algo: 101 ms

after using "NEG"
   .mUNROLLdword was slower, using mUNROLL
   .SHR/SHL was slower

New Algo: 98 ms        Old Algo: 103 ms
New Algo: 98 ms        Old Algo: 102 ms
New Algo: 100 ms       Old Algo: 103 ms
New Algo: 98 ms        Old Algo: 102 ms
New Algo: 99 ms        Old Algo: 100 ms
New Algo: 99 ms        Old Algo: 101 ms

after optm tChar for ecx mUNROLL2

New Algo: 96 ms        Old Algo: 99 ms
New Algo: 94 ms        Old Algo: 104 ms
New Algo: 99 ms        Old Algo: 97 ms   x
New Algo: 92 ms        Old Algo: 98 ms
New Algo: 94 ms        Old Algo: 97 ms
New Algo: 92 ms        Old Algo: 100 ms
New Algo: 96 ms        Old Algo: 97 ms

mineiro

hello sir LordAdef;
I agree with you, some persons use 'lea' instead of 'shl'.
Check this link
http://www.agner.org/optimize/
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: mineiro on March 26, 2017, 08:44:03 PMsome persons use 'lea' instead of 'shl'

Let's time it :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
10      cycles for 100 * shl eax, 1
25      cycles for 100 * lea eax, [2*eax]
10      cycles for 100 * shl eax, 3
17      cycles for 100 * lea eax, [8*eax]
18      cycles for 100 * lea eax, [8*eax+100]
148     cycles for 100 * imul eax, eax, 8

11      cycles for 100 * shl eax, 1
26      cycles for 100 * lea eax, [2*eax]
11      cycles for 100 * shl eax, 3
16      cycles for 100 * lea eax, [8*eax]
17      cycles for 100 * lea eax, [8*eax+100]
147     cycles for 100 * imul eax, eax, 8

10      cycles for 100 * shl eax, 1
25      cycles for 100 * lea eax, [2*eax]
10      cycles for 100 * shl eax, 3
18      cycles for 100 * lea eax, [8*eax]
18      cycles for 100 * lea eax, [8*eax+100]
149     cycles for 100 * imul eax, eax, 8

11      cycles for 100 * shl eax, 1
25      cycles for 100 * lea eax, [2*eax]
10      cycles for 100 * shl eax, 3
16      cycles for 100 * lea eax, [8*eax]
18      cycles for 100 * lea eax, [8*eax+100]
148     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8

mineiro

these are results on my pc:

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

87      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
181     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
181     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8


--- o
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

0 cycles for 100 * lea is very strange ::)

LordAdef

these results are frustrating sometimes.....

sadly I left my pc already. I post my results later today

TWell

AMD E1-6010 APU with AMD Radeon R2 Graphics     (SSE4)

0       cycles for 100 * shl eax, 1
3       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
3       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
173     cycles for 100 * imul eax, eax, 8

1       cycles for 100 * shl eax, 1
4       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
3       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
174     cycles for 100 * imul eax, eax, 8

0       cycles for 100 * shl eax, 1
3       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
4       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
173     cycles for 100 * imul eax, eax, 8

0       cycles for 100 * shl eax, 1
5       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
2       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
172     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8


--- ok ---

mineiro

#14
I tested on linux, I will just reboot and post results on windows.
I'll be back and edit this message.

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
2       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8


--- ok ---

Same results on linux or windows.

---edited---
Other results
     Clock   Core cyc   Instruct       Uops  BrMispred
;-------------------------------------------------------
TestA call shl eax,1
      4131       4126       9000      11000          0
      4131       4126       9000      11000          0
         9          5          9         11          0
1000*shl eax,1
       990        996       1000       1000          0
100000*shl eax,1
    101151     101151     100000     106249          0
;-------------------------------------------------------
TestB call lea eax,[eax*2]
      7695       5126       9000      11000          0
      7686       5126       9000      11000          0
         9          5          9         11          0
1000*lea eax,[eax*2]
       999        995       1000       1000          0
100000*
    105813     105811     100000     112499          0
;-------------------------------------------------------
TestC call shl eax,3
      4248       4253       9000      11000          0
      6381       4253       9000      11000          0
         9          6          9         11          0
1000*shl eax,3
       999        999       1000       1000          0
100000*
    152973     101982     100000     109375          0
;-------------------------------------------------------
TestD call lea eax,[eax*8]
      5256       5252       9000      11000          0
      5256       5252       9000      11000          0
         9          6          9         11          0
1000*lea eax,[eax*8]
       999        996       1000       1000          0
100000*
    307638     307648     100000     125198          0
;-------------------------------------------------------
TestE call lea eax,[eax*8+100]
      5130       5126       9000      11000          0
      7686       5126       9000      11000          0
         0          5          9         11          0
1000*lea eax,[eax*8+100]
       999        996       1000       1000          0
100000*
    322011     214676     100000     125189          0
;-------------------------------------------------------
TestF call imul eax,eax,8
      4257       4253       9000      11000          0
      6372       4253       9000      11000          0
         9          6          9         11          0
1000*imul eax,eax,8
      2997       2995       1000       1000          0
100000*
    450054     300036     100000     109375          0
;--------------------------------------------------------
I'd rather be this ambulant metamorphosis than to have that old opinion about everything