The MASM Forum

General => The Laboratory => Topic started by: LordAdef on March 25, 2017, 04:24:49 PM

Title: Speed testing algos
Post by: LordAdef on March 25, 2017, 04:24:49 PM
Hi there, I usually post in my own thread in the Campus, where I´m making this game. But since this is related to timing algos, I thought this should be the right place.

The thing is, I realized I don´t know how to properly time my code to accurately tell if I´m optimizing it right.
There is always some system interference or some fluctuation. I´m sure there must be a way to accurately measure your code.

In the case below, there are 2 algos which I put in the same proggy. The "old" copies values byte per byte, while the 2nd invokes the memcopy to copy dword chunks first, and the remaing goes byte per byte.

I didn´t optmized it so far, and wanted to call for help before I begin. All I did is change the copy behaviour as described above.  :dazzled:

The old one uses this macro called mUNROLL:
QuotemUNROLL MACRO
         PUSH ecx
         mov ecx, tChar
         xor eax, eax
          @@:
            mov [edi], cl
              cmp eax, ebx
         jz @F          
             add edi, 1
         add eax, 1
         jmp @B
         @@:
         POP ecx
      endm

The "new" uses the mUNROLLdword:
QuotemUNROLLdword MACRO
      PUSH edx
      mov eax, ebx                        ; MOD here
      mov ecx, 4
      cdq
      div ecx                              ; result in eax, sobra in edx; MOD

      sub ebx, edx                        ; only dwords here, we do edx afterwards   
      .IF tChar == 64
         mov eax, offset Ln1
      .ELSEIF tChar == 32
         mov eax,  offset Ln2
      .ENDIF
      invoke MemCopy, eax, edi, ebx      ; copying dword sizes
      add edi, ebx                        ; add a line lenght for edi index
      mov bl, [eax]                        ; this is either @ or space
      xor ecx, ecx
   @@:
      cmp ecx, edx
      jz @F
      mov [edi], bl
      add edi, 1
      add ecx, 1          
      jmp @B
   @@:         
      POP edx
      endm

I´m calling both algos from here:
Quote.code
start:
   ;old algo. Copies byte by byte
   printf ("\nNew Algo: ")
   ;NanoTimer()
   invoke pDecompress, offset NewFile
   ;Print Str$("%i us ", NanoTimer(us))
      
   ;New algo. Copies dwords, then the remaining byte by byte. Uses "memcopy"
   printf ("\nOld Algo: ")
   ;NanoTimer()
   invoke pDecompress2, offset NewFile
   ;Print Str$("%i us ", NanoTimer(us))

I tested multiplying the calls by 4 each, 8 and 16. I also inverted the invoking order, which appeared to change things too.
Quote1 call
old   174   197   192   154   268   227   211   212   157   169   162   219   =2
new   192   161!   157!   153x   187!   173!   158!   200!   165   141!   158!   157!   =9

1 call (New 1st)
new   178   212   186   183   170x   153   196   168   169   193   192   =4
old   140!   250   162!   172!   171x   192   253   147!   162!   135!   202   =6

4 calls:
old   515   616   512   588   740   480   518   503   543   532   =4
new   499!   556!   535   558!   735!   678   481!   551   511!   557   =6

8 calls:
old   995   923   937   1096   1083   914   999   886   934   905   973   =3
new   904!   900!   1100   906!   961!   931   902!   902!   1107   894!   917!   =8

16 calls:
old   1799   1783   2352   1907   3213   1845x   1688   1727x   1844   1732   1717   =4
new   1702!   1730!   1763!   1896!   2843!   1849x   1800   1722x   2030   1826   1756   =5

16 calls (New first):   
new   1913   1784   1969   2162   1837   1900   1728   2630   2165   1729   1763   =0
old   1732!   1658!   1778!   1902!   1733!   1688!   1615!   2029!   1638!   1669!   1649!   =11

If you look at my last test for 16 calls, my New algo looses in every take.
Title: Re: Speed testing algos
Post by: jj2007 on March 25, 2017, 06:56:05 PM
Use a 1000 calls, as shown below:

      printf ("\nNew Algo: ")
      NanoTimer()
      push 999
      .Repeat
            invoke pDecompress, offset NewFile
            dec stack
      .Until Sign?
      pop edx
      Print Str$("%i ms ", NanoTimer(ms))
           
      ;New algo. Copies dwords, then the remaining byte by byte. Uses "memcopy"
      printf ("\tOld Algo: ")
      NanoTimer()
      push 999
      .Repeat
            invoke pDecompress2, offset NewFile
            dec stack
      .Until Sign?
      pop edx
      Print Str$("%i ms ", NanoTimer(ms))
      endm


New Algo: 129 ms        Old Algo: 132 ms
New Algo: 123 ms        Old Algo: 128 ms
New Algo: 126 ms        Old Algo: 123 ms
New Algo: 127 ms        Old Algo: 118 ms
New Algo: 117 ms        Old Algo: 128 ms
New Algo: 129 ms        Old Algo: 138 ms
New Algo: 123 ms        Old Algo: 124 ms
New Algo: 123 ms        Old Algo: 122 ms
New Algo: 123 ms        Old Algo: 124 ms
New Algo: 120 ms        Old Algo: 122 ms


Results in the 100 ms range are much more reliable than something in the us range.
The timer example in the templates is optimised for such problems btw.
Title: Re: Speed testing algos
Post by: LordAdef on March 25, 2017, 07:38:00 PM
Nice JJ!

dec "stack" is new to me by the way. MB?
Title: Re: Speed testing algos
Post by: jj2007 on March 25, 2017, 07:50:39 PM
Quote from: LordAdef on March 25, 2017, 07:38:00 PM
dec "stack" is new to me by the way. MB?

Yes, it's just an equate: stack equ dword ptr [esp]
      push 999
      .Repeat
            invoke pDecompress, offset NewFile
            dec stack   ; same as dec dword ptr [esp]
      .Until Sign?
      pop edx


It will cost an extra cycle in that loop, but that hardly matters, and it's a very simple way to get a counter without using a register. And no risk to end up by accident with an endless loop - I avoid .Until Zero? wherever possible ;)
Title: Re: Speed testing algos
Post by: mineiro on March 26, 2017, 07:30:38 AM
hello sir LordAdef;
I took some time to understand your code, sound to me that you're thinking so hard, you can do this procedure on a more easy way.

To better readable masm have 'struct' keyword, acts like opaque structures.
header_file struct
output_size dd ?
line_count dd ?
line_lenght dd ?
header_file ends

so, on your code you can use do something like:

mov   esi, pstring ;pointer to file
assume esi:ptr header_file
m2m tSum,[esi].output_size
m2m tLine,[esi].line_count
m2m tLineLen,[esi].line_lenght
assume esi:nothing
; add esi,sizeof header_file ;displacement header file
; sub szstring,sizeof header_file ;szstring==szstring-header_file

On your code you have used "mov ecx, DWORD Ptr [esi+8]" to get line_lenght . On line "mov edx, 12 ;counter" you can write "mov edx,sizeof header_file".

You can use some tricks on your code, but you should think on binary base instead of decimal base.

mSWITCHChars MACRO
xor tChar,01100000b
; .IF tChar == "@" ;64 ;1000000b
; mov tChar, " " ;32 ;_100000b
; .ELSEIF tChar == " " ;32
; mov tChar, "@" ;64
; .ENDIF
endm


Other trick is when you're working with multiples of number 2, so, 2,4,8,16,32,64,... . If you write this numbers with binary base you can feel an idea.
        1  1
      10  2
    100  4
  1000  8
10000  16
The number one is walking to left side and inserting zeros at right side, and this is multiplication. So, if you like to multiply something by 4 per example you need only shift the number left by 2, if you like to multiply by 8 you shl number by 3.
So, if walking to left is multiply, walking to right should be divide. So, if you like to divide something by 4 per example you need only shift the number right by 2, if you like to divide by 8 you shr number by 3. This way you get quotient.
But by dividing (shift right) we lost remainder, so, to you get the remainder you can use a mask on original number.
mov eax, ebx  ;ebx holds a number to be divided by 4
mov edx,ebx
shr eax,2 ;quotient, divide by 4
and edx,3 ;remainder, 3 means mask, 00000011b


You're working with sign numbers, so, if you get a negative value you can transform it to positive value by using "neg" instruction.

.IF tChar == "@" ;64; .if tChar == "@@@@"
; mov eax, offset Ln1
.while eax != 0
mov dword ptr [edi],"@@@@"
add edi,4 ;lea edi,[edi+4]
dec eax
.endw
.while edx != 0
mov byte ptr [edi],"@"
inc edi
dec edx
.endw
.ELSE;IF tChar == " " ;32
; mov eax,  offset Ln2
.while eax != 0
mov dword ptr [edi],"    "
add edi,4 ;lea edi,[edi+4]
dec eax
.endw
.while edx != 0
mov byte ptr [edi]," "
inc edi
dec edx
.endw

.ENDIF
Title: Re: Speed testing algos
Post by: LordAdef on March 26, 2017, 06:48:51 PM
Master of the Golden Lands of Minas Gerais, Mr. Mineiro!!

Thanks my friend.
I´m having a lot of fun only dealing with asm without having to deal with Windows....

I´m prototyping these algos and making them to work first. Then, I´ll be tweaking and fiddling with it so to optimize it for speed.

I was having issues to time my algos, that´s why I wrote this thread. Without timing it properly I could not possibly judge anything. JJ has given me the timing tip above and now it´s fun time.

"NEG" => I have forgot about this one!!!! Thanks, I was wasting some valuable cycles there

SHR/SHL I´ve read somewhere these are kind of slower instructions. I guess there must be a cost relation and my extra code is in fact slower. I´ll check them

Please, keep it coming!
Cheers
Alex
Title: Re: Speed testing algos
Post by: jj2007 on March 26, 2017, 07:55:00 PM
Hi Alex,

You should isolate the file load times from your algos. See the bookmarks in the attachment for an example.

In your case, it doesn't matter a lot because the algos have a lot of work to do, but for fast short algos it makes a big difference if you are just working on a buffer, or if the buffer needs to be filled from disk before the short work starts.

(in conditional assembly, ife means "if equal" or "if zero" - see free esi)
Title: Re: Speed testing algos
Post by: LordAdef on March 26, 2017, 08:24:56 PM
Dear JJ,

I´ll check this out now!
I must thank you, you can´t imagine how useful your timing snippet above is being right now! I was missing that.

I´ve already started to fiddle with the algo:
- mUNROLLdword  macro was slower so I got back mUNROLL and am optimizing it. 
- NEG was clearly faster than my maths
- SHR/SHL made a negative effect on the code, slower
- Working on optimizing  reg usage, with good results

So far:
Quote
New Algo: 103 ms       Old Algo: 97 ms
New Algo: 102 ms       Old Algo: 96 ms
New Algo: 102 ms       Old Algo: 97 ms
New Algo: 105 ms       Old Algo: 97 ms
New Algo: 108 ms       Old Algo: 100 ms
New Algo: 109 ms       Old Algo: 101 ms

after using "NEG"
   .mUNROLLdword was slower, using mUNROLL
   .SHR/SHL was slower

New Algo: 98 ms        Old Algo: 103 ms
New Algo: 98 ms        Old Algo: 102 ms
New Algo: 100 ms       Old Algo: 103 ms
New Algo: 98 ms        Old Algo: 102 ms
New Algo: 99 ms        Old Algo: 100 ms
New Algo: 99 ms        Old Algo: 101 ms

after optm tChar for ecx mUNROLL2

New Algo: 96 ms        Old Algo: 99 ms
New Algo: 94 ms        Old Algo: 104 ms
New Algo: 99 ms        Old Algo: 97 ms   x
New Algo: 92 ms        Old Algo: 98 ms
New Algo: 94 ms        Old Algo: 97 ms
New Algo: 92 ms        Old Algo: 100 ms
New Algo: 96 ms        Old Algo: 97 ms
Title: Re: Speed testing algos
Post by: mineiro on March 26, 2017, 08:44:03 PM
hello sir LordAdef;
I agree with you, some persons use 'lea' instead of 'shl'.
Check this link
http://www.agner.org/optimize/
Title: Re: Speed testing algos
Post by: jj2007 on March 26, 2017, 09:17:05 PM
Quote from: mineiro on March 26, 2017, 08:44:03 PMsome persons use 'lea' instead of 'shl'

Let's time it :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
10      cycles for 100 * shl eax, 1
25      cycles for 100 * lea eax, [2*eax]
10      cycles for 100 * shl eax, 3
17      cycles for 100 * lea eax, [8*eax]
18      cycles for 100 * lea eax, [8*eax+100]
148     cycles for 100 * imul eax, eax, 8

11      cycles for 100 * shl eax, 1
26      cycles for 100 * lea eax, [2*eax]
11      cycles for 100 * shl eax, 3
16      cycles for 100 * lea eax, [8*eax]
17      cycles for 100 * lea eax, [8*eax+100]
147     cycles for 100 * imul eax, eax, 8

10      cycles for 100 * shl eax, 1
25      cycles for 100 * lea eax, [2*eax]
10      cycles for 100 * shl eax, 3
18      cycles for 100 * lea eax, [8*eax]
18      cycles for 100 * lea eax, [8*eax+100]
149     cycles for 100 * imul eax, eax, 8

11      cycles for 100 * shl eax, 1
25      cycles for 100 * lea eax, [2*eax]
10      cycles for 100 * shl eax, 3
16      cycles for 100 * lea eax, [8*eax]
18      cycles for 100 * lea eax, [8*eax+100]
148     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8
Title: Re: Speed testing algos
Post by: mineiro on March 26, 2017, 09:23:04 PM
these are results on my pc:

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

87      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
181     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
181     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8


--- o
Title: Re: Speed testing algos
Post by: jj2007 on March 26, 2017, 09:35:52 PM
0 cycles for 100 * lea is very strange ::)
Title: Re: Speed testing algos
Post by: LordAdef on March 26, 2017, 09:37:42 PM
these results are frustrating sometimes.....

sadly I left my pc already. I post my results later today
Title: Re: Speed testing algos
Post by: TWell on March 26, 2017, 09:40:23 PM
AMD E1-6010 APU with AMD Radeon R2 Graphics     (SSE4)

0       cycles for 100 * shl eax, 1
3       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
3       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
173     cycles for 100 * imul eax, eax, 8

1       cycles for 100 * shl eax, 1
4       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
3       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
174     cycles for 100 * imul eax, eax, 8

0       cycles for 100 * shl eax, 1
3       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
4       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
173     cycles for 100 * imul eax, eax, 8

0       cycles for 100 * shl eax, 1
5       cycles for 100 * lea eax, [2*eax]
0       cycles for 100 * shl eax, 3
2       cycles for 100 * lea eax, [8*eax]
2       cycles for 100 * lea eax, [8*eax+100]
172     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8


--- ok ---
Title: Re: Speed testing algos
Post by: mineiro on March 26, 2017, 09:43:13 PM
I tested on linux, I will just reboot and post results on windows.
I'll be back and edit this message.

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
2       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

12      cycles for 100 * shl eax, 1
1       cycles for 100 * lea eax, [2*eax]
12      cycles for 100 * shl eax, 3
1       cycles for 100 * lea eax, [8*eax]
0       cycles for 100 * lea eax, [8*eax+100]
180     cycles for 100 * imul eax, eax, 8

2       bytes for shl eax, 1
7       bytes for lea eax, [2*eax]
3       bytes for shl eax, 3
7       bytes for lea eax, [8*eax]
7       bytes for lea eax, [8*eax+100]
3       bytes for imul eax, eax, 8


--- ok ---

Same results on linux or windows.

---edited---
Other results
     Clock   Core cyc   Instruct       Uops  BrMispred
;-------------------------------------------------------
TestA call shl eax,1
      4131       4126       9000      11000          0
      4131       4126       9000      11000          0
         9          5          9         11          0
1000*shl eax,1
       990        996       1000       1000          0
100000*shl eax,1
    101151     101151     100000     106249          0
;-------------------------------------------------------
TestB call lea eax,[eax*2]
      7695       5126       9000      11000          0
      7686       5126       9000      11000          0
         9          5          9         11          0
1000*lea eax,[eax*2]
       999        995       1000       1000          0
100000*
    105813     105811     100000     112499          0
;-------------------------------------------------------
TestC call shl eax,3
      4248       4253       9000      11000          0
      6381       4253       9000      11000          0
         9          6          9         11          0
1000*shl eax,3
       999        999       1000       1000          0
100000*
    152973     101982     100000     109375          0
;-------------------------------------------------------
TestD call lea eax,[eax*8]
      5256       5252       9000      11000          0
      5256       5252       9000      11000          0
         9          6          9         11          0
1000*lea eax,[eax*8]
       999        996       1000       1000          0
100000*
    307638     307648     100000     125198          0
;-------------------------------------------------------
TestE call lea eax,[eax*8+100]
      5130       5126       9000      11000          0
      7686       5126       9000      11000          0
         0          5          9         11          0
1000*lea eax,[eax*8+100]
       999        996       1000       1000          0
100000*
    322011     214676     100000     125189          0
;-------------------------------------------------------
TestF call imul eax,eax,8
      4257       4253       9000      11000          0
      6372       4253       9000      11000          0
         9          6          9         11          0
1000*imul eax,eax,8
      2997       2995       1000       1000          0
100000*
    450054     300036     100000     109375          0
;--------------------------------------------------------
Title: Re: Speed testing algos
Post by: LordAdef on March 27, 2017, 04:08:54 PM
Thanks to Mineiro´s tips, I managed to squeeze some good ms from the algo.

I took his dword example above but changed with a little trick I think saved me some time. Instead of doing the dword copy loop and then the byte to byte for the remainders, I rolled the dword loop 1 step beyond my target. Afterwards, I set my ptr index back. Although I will be rewriting 1 or 2 bytes, I got rid of a lot of code:

Quote
      mUNROLL3 MACRO   ;NEW
         ; copy dword chunks + 1 dword
         ; edi is brought back overwritting the extra bytes
      xor eax, eax
      .IF ecx == 64
         @@:
         cmp eax, ebx
         ja @F                           ; it goes beyond ebx
         mov dword ptr [edi+eax],"@@@@"
         add eax, 4
         jmp @B
        @@:
         add edi, ebx   

The history of my timings step by step (resumed, full list in the zip):
Quote
New Algo: 103 ms       Old Algo: 97 ms
New Algo: 102 ms       Old Algo: 96 ms
New Algo: 102 ms       Old Algo: 97 ms

after using "NEG":
   .mUNROLLdword was slower, using mUNROLL
   .SHR/SHL was slower

New Algo: 98 ms        Old Algo: 103 ms
New Algo: 98 ms        Old Algo: 102 ms
New Algo: 98 ms        Old Algo: 102 ms

after optm tChar for ecx mUNROLL2:

New Algo: 96 ms        Old Algo: 99 ms
New Algo: 94 ms        Old Algo: 104 ms
New Algo: 92 ms        Old Algo: 98 ms

after exchanging tChar for ecx:
New Algo: 91 ms        Old Algo: 97 ms
New Algo: 93 ms        Old Algo:134 ms
New Algo: 93 ms        Old Algo: 97 ms

after xor tChar, 01100000b
New Algo: 91 ms        Old Algo: 98 ms
New Algo: 90 ms        Old Algo: 97 ms
New Algo: 91 ms        Old Algo: 98 ms

after using the adapted Mineiro´s dword copy code
   .got rid of tChar local. Using ECX
New Algo: 65 ms        Old Algo: 102 ms
New Algo: 65 ms        Old Algo: 98 ms
New Algo: 66 ms        Old Algo: 102 ms

after making LineLenght an EQU
New Algo: 65 ms        Old Algo: 96 ms
New Algo: 64 ms        Old Algo: 96 ms
New Algo: 64 ms        Old Algo: 97 ms

I left old code commented for comparison reasons.

Do you guys see any place I can still improve within this algo design?
Title: Re: Speed testing algos
Post by: jj2007 on March 27, 2017, 06:41:38 PM
What about this: mov edx, 12-1 ;counter
xor ebx, ebx
bak:
inc edx
@Main:
cmp edx, tLen
jz DoneDC ; DONE?
mov bl, SBYTE Ptr [esi+edx]
test bl, bl
js Negative
jz Zero
; ------------------------- It´s Positive:
add tLineLen, 1 ; inc line num
add tSum, ebx ; Sums line (to recouver missing chunk)
mUNROLL3 ;x ; MUST HAVE ecx == 64 or 32
xor ecx, 01100000b ;x
jmp bak
;       bak:
; add edx, 1
; jmp @Main
Title: Re: Speed testing algos
Post by: mineiro on March 28, 2017, 02:32:03 AM
hello sir LordAdef;

Ecx holds value to be stored
mUNROLL3 MACRO ;NEW
; copy dword chunks + 1 dword
; edi is brought back overwritting the extra bytes

xor eax, eax
; .IF ecx == "@@@@" ;64
   @@:
cmp eax, ebx
ja @F ; it goes beyond ebx
mov dword ptr [edi+eax],ecx ;"@@@@"
add eax, 4
jmp @B
  @@:
add edi, ebx
; .ELSE
;    @@:
; cmp eax, ebx ; lenght of line
; ja @F
; mov dword ptr [edi+eax],"    "
; add eax, 4
; jmp @B
;   @@:
; add edi, ebx
; .ENDIF
endm


You now put "@" or space into a register, so why not expand that to "@@@@" or "    "?
pDecompressNEW proc, string:DWORD

      LOCAL  tLine:DWORD
      LOCAL  tLineLen :DWORD
      LOCAL  tSum:DWORD

      LOCAL  tLen :DWORD
      LOCAL  tChar:DWORD ; 64@ or 32" " ; NOT USING TCHAR -> ECX

mov tLine, 0
mov tLineLen, 0
mov tSum, 0
   
    LineLenght = 160 ; Lenght of line is fixed to 160
   
    mov   edi, offset Decomp
mov   esi, InputFile(string)
    mov   tLen, ecx
mov ecx,"@@@@" ;<-------------
mov edx, 12
xor ebx, ebx
@Main:
cmp edx, tLen
jz DoneDC

mov bl, SBYTE Ptr [esi+edx] ;carrego um byte que contém um contador
test bl, bl ;se for zero ou negativo
js Negative
jz Zero
; ------------------------- ItŽs Positive:
add tLineLen, 1
add tSum, ebx
mUNROLL3
xor ecx, 01100000011000000110000001100000b ;<------------ permute 4 @@@@ to 4 spaces or vice-versa
bak:
add edx, 1
jmp @Main

Negative:
  add tLineLen, 1
neg bl
mov ecx, "    "  ;<-----------------
add tSum, ebx
mUNROLL3
; ----------- Missing chunk -----------
mov ebx, LineLenght ;x
xor ecx, 01100000011000000110000001100000b ;<------------ permute 4 @@@@ to 4 spaces or vice-versa
sub ebx, tSum ; Missing chunk value

mUNROLL3 ;x ; copy macro, byte per byte
mCHECKnext ; ecx is still the val of tChar. "mov tChar, ecx" here at this macro
  add tLine, 1
jmp bak

  Zero: ; Copy last line from edi itself. Uses MemCopy, dword.
mov eax, edi
sub eax, LineLenght

invoke MemCopy, eax, edi, LineLenght
add edi, LineLenght

mCHECKnext ;check if next val is zero -> makes LineLen & Sum = 0
add tLine, 1
mov ecx, "@@@@" ;<----------
jmp bak

DoneDC:
free esi
xor edi, edi ;edi is index 
ret
pDecompressNEW endp
Title: Re: Speed testing algos
Post by: LordAdef on March 28, 2017, 03:11:26 AM
hey Mineiro,

you are right, it's so obvious and still it didn't click me. in fact, with that approach, I can even loose that conditional .IF ecx==   since ecx already holds the value we need. We just need to copy ecx straight away.
I can't wait to time this and see if that saves 1 or 2 extra ms.

JJ idea above (jmp bak) is also very good! it saves 1 jump for every value. Clean and sutil.
Title: Re: Speed testing algos
Post by: LordAdef on March 28, 2017, 03:55:04 AM
Quoteafter making LineLenght an EQU
New Algo: 65 ms        Old Algo: 96 ms
New Algo: 65 ms        Old Algo: 98 ms
New Algo: 65 ms        Old Algo: 96 ms
New Algo: 64 ms        Old Algo: 96 ms
New Algo: 64 ms        Old Algo: 97 ms

after ecx = "@@@@", and bak loop adjusted
New Algo: 63 ms        Old Algo: 99 ms
New Algo: 62 ms        Old Algo: 100 ms
New Algo: 61 ms        Old Algo: 99 ms
New Algo: 62 ms        Old Algo: 98 ms
New Algo: 62 ms        Old Algo: 99 ms
New Algo: 61 ms        Old Algo: 99 ms
New Algo: 61 ms        Old Algo: 98 ms
New Algo: 61 ms        Old Algo: 99 ms

The latest moves gave it a boost of about 2-3 ms. Nice!
Title: Re: Speed testing algos
Post by: mineiro on March 28, 2017, 04:09:12 AM
I think that if you describe what your code should do we can help more.
I don't get a clear idea about what your code is doing, well, what negative,positive or zero piece codes should do.
If you describe so I can create a code and we can measure this function.

---edited----
something like, you're dealing with a map that have 2 chars (@ or space), so, you can fullfill a map with most used char, I'm supposing "@",so, the code only need draw(overwrite) river (space).
Have a lot of approaches to be tried.
Title: Re: Speed testing algos
Post by: LordAdef on March 28, 2017, 05:10:56 AM
Sorry Mineiro,

This is related to the game I am developing in the other thread. this algo reads back my RLE map backgrounf file. Since you posted there I thought you were in the loop:

http://masm32.com/board/index.php?topic=5962.msg64373#msg64373 (http://masm32.com/board/index.php?topic=5962.msg64373#msg64373)

This is what it is, with an example. the game is a River Raid recreation.
Title: Re: Speed testing algos
Post by: mineiro on March 28, 2017, 08:37:22 AM
oh yes, my fault, I'm reading that code now.
Title: Re: Speed testing algos
Post by: mineiro on March 29, 2017, 01:34:24 AM
Hello sir LordAdef;
Is this description right or wrong?

compressed file.dat holds a structure with 3 dword members (original size, line count and line lenght) followed by byte counters

char = @ || space
while byte counters != end of file do
if counter < 0, neg counter,count sum, store N chars, permute char, store char (160-count sum) times
if counter > 0, store N chars, permute char, count sum
if counter = 0, copy last 160 uncompressed chars (repeat last line)

Title: Re: Speed testing algos
Post by: LordAdef on March 29, 2017, 11:15:09 AM
Hi Mineiro,

That´s the idea.
Visually, it goes like this:

@@@@@@@@@@     @@@     @@@@@@@@@@ -> this is a line. Now, my map is set to 160 chars
            10               5    3      5               x             

when compressing, I loose the last chunk (mark as x) since I can put it back on decomp. I neg the chunk before the last, to mark it as my "end of line".

So, This line will be compressed like this:   "10, 5, 3, -5"
Similar lines are mark as "0"
Like:
@@@@@@@@@@     @@@     @@@@@@@@@@
@@@@@@@@@@     @@@     @@@@@@@@@@
@@@@@@@@@@     @@@     @@@@@@@@@@

= 10, 5, 3, -5, 0, 0

you got it!
Abraços!
Alex