Hi there, I usually post in my own thread in the Campus, where I´m making this game. But since this is related to timing algos, I thought this should be the right place.
The thing is, I realized I don´t know how to properly time my code to accurately tell if I´m optimizing it right.
There is always some system interference or some fluctuation. I´m sure there must be a way to accurately measure your code.
In the case below, there are 2 algos which I put in the same proggy. The "old" copies values byte per byte, while the 2nd invokes the memcopy to copy dword chunks first, and the remaing goes byte per byte.
I didn´t optmized it so far, and wanted to call for help before I begin. All I did is change the copy behaviour as described above. :dazzled:
The old one uses this macro called mUNROLL:
QuotemUNROLL MACRO
PUSH ecx
mov ecx, tChar
xor eax, eax
@@:
mov [edi], cl
cmp eax, ebx
jz @F
add edi, 1
add eax, 1
jmp @B
@@:
POP ecx
endm
The "new" uses the mUNROLLdword:
QuotemUNROLLdword MACRO
PUSH edx
mov eax, ebx ; MOD here
mov ecx, 4
cdq
div ecx ; result in eax, sobra in edx; MOD
sub ebx, edx ; only dwords here, we do edx afterwards
.IF tChar == 64
mov eax, offset Ln1
.ELSEIF tChar == 32
mov eax, offset Ln2
.ENDIF
invoke MemCopy, eax, edi, ebx ; copying dword sizes
add edi, ebx ; add a line lenght for edi index
mov bl, [eax] ; this is either @ or space
xor ecx, ecx
@@:
cmp ecx, edx
jz @F
mov [edi], bl
add edi, 1
add ecx, 1
jmp @B
@@:
POP edx
endm
I´m calling both algos from here:
Quote.code
start:
;old algo. Copies byte by byte
printf ("\nNew Algo: ")
;NanoTimer()
invoke pDecompress, offset NewFile
;Print Str$("%i us ", NanoTimer(us))
;New algo. Copies dwords, then the remaining byte by byte. Uses "memcopy"
printf ("\nOld Algo: ")
;NanoTimer()
invoke pDecompress2, offset NewFile
;Print Str$("%i us ", NanoTimer(us))
I tested multiplying the calls by 4 each, 8 and 16. I also inverted the invoking order, which appeared to change things too.
Quote1 call
old 174 197 192 154 268 227 211 212 157 169 162 219 =2
new 192 161! 157! 153x 187! 173! 158! 200! 165 141! 158! 157! =9
1 call (New 1st)
new 178 212 186 183 170x 153 196 168 169 193 192 =4
old 140! 250 162! 172! 171x 192 253 147! 162! 135! 202 =6
4 calls:
old 515 616 512 588 740 480 518 503 543 532 =4
new 499! 556! 535 558! 735! 678 481! 551 511! 557 =6
8 calls:
old 995 923 937 1096 1083 914 999 886 934 905 973 =3
new 904! 900! 1100 906! 961! 931 902! 902! 1107 894! 917! =8
16 calls:
old 1799 1783 2352 1907 3213 1845x 1688 1727x 1844 1732 1717 =4
new 1702! 1730! 1763! 1896! 2843! 1849x 1800 1722x 2030 1826 1756 =5
16 calls (New first):
new 1913 1784 1969 2162 1837 1900 1728 2630 2165 1729 1763 =0
old 1732! 1658! 1778! 1902! 1733! 1688! 1615! 2029! 1638! 1669! 1649! =11
If you look at my last test for 16 calls, my New algo looses in every take.
Use a 1000 calls, as shown below:
printf ("\nNew Algo: ")
NanoTimer()
push 999
.Repeat
invoke pDecompress, offset NewFile
dec stack
.Until Sign?
pop edx
Print Str$("%i ms ", NanoTimer(ms))
;New algo. Copies dwords, then the remaining byte by byte. Uses "memcopy"
printf ("\tOld Algo: ")
NanoTimer()
push 999
.Repeat
invoke pDecompress2, offset NewFile
dec stack
.Until Sign?
pop edx
Print Str$("%i ms ", NanoTimer(ms))
endm
New Algo: 129 ms Old Algo: 132 ms
New Algo: 123 ms Old Algo: 128 ms
New Algo: 126 ms Old Algo: 123 ms
New Algo: 127 ms Old Algo: 118 ms
New Algo: 117 ms Old Algo: 128 ms
New Algo: 129 ms Old Algo: 138 ms
New Algo: 123 ms Old Algo: 124 ms
New Algo: 123 ms Old Algo: 122 ms
New Algo: 123 ms Old Algo: 124 ms
New Algo: 120 ms Old Algo: 122 ms
Results in the 100 ms range are much more reliable than something in the us range.
The timer example in the templates is optimised for such problems btw.
Nice JJ!
dec "stack" is new to me by the way. MB?
Quote from: LordAdef on March 25, 2017, 07:38:00 PM
dec "stack" is new to me by the way. MB?
Yes, it's just an equate: stack equ dword ptr [esp]
push 999
.Repeat
invoke pDecompress, offset NewFile
dec stack ; same as dec dword ptr [esp]
.Until Sign?
pop edx
It will cost an extra cycle in that loop, but that hardly matters, and it's a very simple way to get a counter without using a register. And no risk to end up by accident with an endless loop - I avoid
.Until Zero? wherever possible ;)
hello sir LordAdef;
I took some time to understand your code, sound to me that you're thinking so hard, you can do this procedure on a more easy way.
To better readable masm have 'struct' keyword, acts like opaque structures.
header_file struct
output_size dd ?
line_count dd ?
line_lenght dd ?
header_file ends
so, on your code you can use do something like:
mov esi, pstring ;pointer to file
assume esi:ptr header_file
m2m tSum,[esi].output_size
m2m tLine,[esi].line_count
m2m tLineLen,[esi].line_lenght
assume esi:nothing
; add esi,sizeof header_file ;displacement header file
; sub szstring,sizeof header_file ;szstring==szstring-header_file
On your code you have used "mov ecx, DWORD Ptr [esi+8]" to get line_lenght . On line "mov edx, 12 ;counter" you can write "mov edx,sizeof header_file".
You can use some tricks on your code, but you should think on binary base instead of decimal base.
mSWITCHChars MACRO
xor tChar,01100000b
; .IF tChar == "@" ;64 ;1000000b
; mov tChar, " " ;32 ;_100000b
; .ELSEIF tChar == " " ;32
; mov tChar, "@" ;64
; .ENDIF
endm
Other trick is when you're working with multiples of number 2, so, 2,4,8,16,32,64,... . If you write this numbers with binary base you can feel an idea.
1 1
10 2
100 4
1000 8
10000 16
The number one is walking to left side and inserting zeros at right side, and this is multiplication. So, if you like to multiply something by 4 per example you need only shift the number left by 2, if you like to multiply by 8 you shl number by 3.
So, if walking to left is multiply, walking to right should be divide. So, if you like to divide something by 4 per example you need only shift the number right by 2, if you like to divide by 8 you shr number by 3. This way you get quotient.
But by dividing (shift right) we lost remainder, so, to you get the remainder you can use a mask on original number.
mov eax, ebx ;ebx holds a number to be divided by 4
mov edx,ebx
shr eax,2 ;quotient, divide by 4
and edx,3 ;remainder, 3 means mask, 00000011b
You're working with sign numbers, so, if you get a negative value you can transform it to positive value by using "neg" instruction.
.IF tChar == "@" ;64; .if tChar == "@@@@"
; mov eax, offset Ln1
.while eax != 0
mov dword ptr [edi],"@@@@"
add edi,4 ;lea edi,[edi+4]
dec eax
.endw
.while edx != 0
mov byte ptr [edi],"@"
inc edi
dec edx
.endw
.ELSE;IF tChar == " " ;32
; mov eax, offset Ln2
.while eax != 0
mov dword ptr [edi]," "
add edi,4 ;lea edi,[edi+4]
dec eax
.endw
.while edx != 0
mov byte ptr [edi]," "
inc edi
dec edx
.endw
.ENDIF
Master of the Golden Lands of Minas Gerais, Mr. Mineiro!!
Thanks my friend.
I´m having a lot of fun only dealing with asm without having to deal with Windows....
I´m prototyping these algos and making them to work first. Then, I´ll be tweaking and fiddling with it so to optimize it for speed.
I was having issues to time my algos, that´s why I wrote this thread. Without timing it properly I could not possibly judge anything. JJ has given me the timing tip above and now it´s fun time.
"NEG" => I have forgot about this one!!!! Thanks, I was wasting some valuable cycles there
SHR/SHL I´ve read somewhere these are kind of slower instructions. I guess there must be a cost relation and my extra code is in fact slower. I´ll check them
Please, keep it coming!
Cheers
Alex
Hi Alex,
You should isolate the file load times from your algos. See the bookmarks in the attachment for an example.
In your case, it doesn't matter a lot because the algos have a lot of work to do, but for fast short algos it makes a big difference if you are just working on a buffer, or if the buffer needs to be filled from disk before the short work starts.
(in conditional assembly, ife means "if equal" or "if zero" - see free esi)
Dear JJ,
I´ll check this out now!
I must thank you, you can´t imagine how useful your timing snippet above is being right now! I was missing that.
I´ve already started to fiddle with the algo:
- mUNROLLdword macro was slower so I got back mUNROLL and am optimizing it.
- NEG was clearly faster than my maths
- SHR/SHL made a negative effect on the code, slower
- Working on optimizing reg usage, with good results
So far:
Quote
New Algo: 103 ms Old Algo: 97 ms
New Algo: 102 ms Old Algo: 96 ms
New Algo: 102 ms Old Algo: 97 ms
New Algo: 105 ms Old Algo: 97 ms
New Algo: 108 ms Old Algo: 100 ms
New Algo: 109 ms Old Algo: 101 ms
after using "NEG"
.mUNROLLdword was slower, using mUNROLL
.SHR/SHL was slower
New Algo: 98 ms Old Algo: 103 ms
New Algo: 98 ms Old Algo: 102 ms
New Algo: 100 ms Old Algo: 103 ms
New Algo: 98 ms Old Algo: 102 ms
New Algo: 99 ms Old Algo: 100 ms
New Algo: 99 ms Old Algo: 101 ms
after optm tChar for ecx mUNROLL2
New Algo: 96 ms Old Algo: 99 ms
New Algo: 94 ms Old Algo: 104 ms
New Algo: 99 ms Old Algo: 97 ms x
New Algo: 92 ms Old Algo: 98 ms
New Algo: 94 ms Old Algo: 97 ms
New Algo: 92 ms Old Algo: 100 ms
New Algo: 96 ms Old Algo: 97 ms
hello sir LordAdef;
I agree with you, some persons use 'lea' instead of 'shl'.
Check this link
http://www.agner.org/optimize/
Quote from: mineiro on March 26, 2017, 08:44:03 PMsome persons use 'lea' instead of 'shl'
Let's time it :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
10 cycles for 100 * shl eax, 1
25 cycles for 100 * lea eax, [2*eax]
10 cycles for 100 * shl eax, 3
17 cycles for 100 * lea eax, [8*eax]
18 cycles for 100 * lea eax, [8*eax+100]
148 cycles for 100 * imul eax, eax, 8
11 cycles for 100 * shl eax, 1
26 cycles for 100 * lea eax, [2*eax]
11 cycles for 100 * shl eax, 3
16 cycles for 100 * lea eax, [8*eax]
17 cycles for 100 * lea eax, [8*eax+100]
147 cycles for 100 * imul eax, eax, 8
10 cycles for 100 * shl eax, 1
25 cycles for 100 * lea eax, [2*eax]
10 cycles for 100 * shl eax, 3
18 cycles for 100 * lea eax, [8*eax]
18 cycles for 100 * lea eax, [8*eax+100]
149 cycles for 100 * imul eax, eax, 8
11 cycles for 100 * shl eax, 1
25 cycles for 100 * lea eax, [2*eax]
10 cycles for 100 * shl eax, 3
16 cycles for 100 * lea eax, [8*eax]
18 cycles for 100 * lea eax, [8*eax+100]
148 cycles for 100 * imul eax, eax, 8
2 bytes for shl eax, 1
7 bytes for lea eax, [2*eax]
3 bytes for shl eax, 3
7 bytes for lea eax, [8*eax]
7 bytes for lea eax, [8*eax+100]
3 bytes for imul eax, eax, 8
these are results on my pc:
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
180 cycles for 100 * imul eax, eax, 8
87 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
181 cycles for 100 * imul eax, eax, 8
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
181 cycles for 100 * imul eax, eax, 8
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
180 cycles for 100 * imul eax, eax, 8
2 bytes for shl eax, 1
7 bytes for lea eax, [2*eax]
3 bytes for shl eax, 3
7 bytes for lea eax, [8*eax]
7 bytes for lea eax, [8*eax+100]
3 bytes for imul eax, eax, 8
--- o
0 cycles for 100 * lea is very strange ::)
these results are frustrating sometimes.....
sadly I left my pc already. I post my results later today
AMD E1-6010 APU with AMD Radeon R2 Graphics (SSE4)
0 cycles for 100 * shl eax, 1
3 cycles for 100 * lea eax, [2*eax]
0 cycles for 100 * shl eax, 3
3 cycles for 100 * lea eax, [8*eax]
2 cycles for 100 * lea eax, [8*eax+100]
173 cycles for 100 * imul eax, eax, 8
1 cycles for 100 * shl eax, 1
4 cycles for 100 * lea eax, [2*eax]
0 cycles for 100 * shl eax, 3
3 cycles for 100 * lea eax, [8*eax]
2 cycles for 100 * lea eax, [8*eax+100]
174 cycles for 100 * imul eax, eax, 8
0 cycles for 100 * shl eax, 1
3 cycles for 100 * lea eax, [2*eax]
0 cycles for 100 * shl eax, 3
4 cycles for 100 * lea eax, [8*eax]
2 cycles for 100 * lea eax, [8*eax+100]
173 cycles for 100 * imul eax, eax, 8
0 cycles for 100 * shl eax, 1
5 cycles for 100 * lea eax, [2*eax]
0 cycles for 100 * shl eax, 3
2 cycles for 100 * lea eax, [8*eax]
2 cycles for 100 * lea eax, [8*eax+100]
172 cycles for 100 * imul eax, eax, 8
2 bytes for shl eax, 1
7 bytes for lea eax, [2*eax]
3 bytes for shl eax, 3
7 bytes for lea eax, [8*eax]
7 bytes for lea eax, [8*eax+100]
3 bytes for imul eax, eax, 8
--- ok ---
I tested on linux, I will just reboot and post results on windows.
I'll be back and edit this message.
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
180 cycles for 100 * imul eax, eax, 8
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
180 cycles for 100 * imul eax, eax, 8
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
2 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
180 cycles for 100 * imul eax, eax, 8
12 cycles for 100 * shl eax, 1
1 cycles for 100 * lea eax, [2*eax]
12 cycles for 100 * shl eax, 3
1 cycles for 100 * lea eax, [8*eax]
0 cycles for 100 * lea eax, [8*eax+100]
180 cycles for 100 * imul eax, eax, 8
2 bytes for shl eax, 1
7 bytes for lea eax, [2*eax]
3 bytes for shl eax, 3
7 bytes for lea eax, [8*eax]
7 bytes for lea eax, [8*eax+100]
3 bytes for imul eax, eax, 8
--- ok ---
Same results on linux or windows.
---edited---
Other results
Clock Core cyc Instruct Uops BrMispred
;-------------------------------------------------------
TestA call shl eax,1
4131 4126 9000 11000 0
4131 4126 9000 11000 0
9 5 9 11 0
1000*shl eax,1
990 996 1000 1000 0
100000*shl eax,1
101151 101151 100000 106249 0
;-------------------------------------------------------
TestB call lea eax,[eax*2]
7695 5126 9000 11000 0
7686 5126 9000 11000 0
9 5 9 11 0
1000*lea eax,[eax*2]
999 995 1000 1000 0
100000*
105813 105811 100000 112499 0
;-------------------------------------------------------
TestC call shl eax,3
4248 4253 9000 11000 0
6381 4253 9000 11000 0
9 6 9 11 0
1000*shl eax,3
999 999 1000 1000 0
100000*
152973 101982 100000 109375 0
;-------------------------------------------------------
TestD call lea eax,[eax*8]
5256 5252 9000 11000 0
5256 5252 9000 11000 0
9 6 9 11 0
1000*lea eax,[eax*8]
999 996 1000 1000 0
100000*
307638 307648 100000 125198 0
;-------------------------------------------------------
TestE call lea eax,[eax*8+100]
5130 5126 9000 11000 0
7686 5126 9000 11000 0
0 5 9 11 0
1000*lea eax,[eax*8+100]
999 996 1000 1000 0
100000*
322011 214676 100000 125189 0
;-------------------------------------------------------
TestF call imul eax,eax,8
4257 4253 9000 11000 0
6372 4253 9000 11000 0
9 6 9 11 0
1000*imul eax,eax,8
2997 2995 1000 1000 0
100000*
450054 300036 100000 109375 0
;--------------------------------------------------------
Thanks to Mineiro´s tips, I managed to squeeze some good ms from the algo.
I took his dword example above but changed with a little trick I think saved me some time. Instead of doing the dword copy loop and then the byte to byte for the remainders, I rolled the dword loop 1 step beyond my target. Afterwards, I set my ptr index back. Although I will be rewriting 1 or 2 bytes, I got rid of a lot of code:
Quote
mUNROLL3 MACRO ;NEW
; copy dword chunks + 1 dword
; edi is brought back overwritting the extra bytes
xor eax, eax
.IF ecx == 64
@@:
cmp eax, ebx
ja @F ; it goes beyond ebx
mov dword ptr [edi+eax],"@@@@"
add eax, 4
jmp @B
@@:
add edi, ebx
The history of my timings step by step (resumed, full list in the zip):
Quote
New Algo: 103 ms Old Algo: 97 ms
New Algo: 102 ms Old Algo: 96 ms
New Algo: 102 ms Old Algo: 97 ms
after using "NEG":
.mUNROLLdword was slower, using mUNROLL
.SHR/SHL was slower
New Algo: 98 ms Old Algo: 103 ms
New Algo: 98 ms Old Algo: 102 ms
New Algo: 98 ms Old Algo: 102 ms
after optm tChar for ecx mUNROLL2:
New Algo: 96 ms Old Algo: 99 ms
New Algo: 94 ms Old Algo: 104 ms
New Algo: 92 ms Old Algo: 98 ms
after exchanging tChar for ecx:
New Algo: 91 ms Old Algo: 97 ms
New Algo: 93 ms Old Algo:134 ms
New Algo: 93 ms Old Algo: 97 ms
after xor tChar, 01100000b
New Algo: 91 ms Old Algo: 98 ms
New Algo: 90 ms Old Algo: 97 ms
New Algo: 91 ms Old Algo: 98 ms
after using the adapted Mineiro´s dword copy code
.got rid of tChar local. Using ECX
New Algo: 65 ms Old Algo: 102 ms
New Algo: 65 ms Old Algo: 98 ms
New Algo: 66 ms Old Algo: 102 ms
after making LineLenght an EQU
New Algo: 65 ms Old Algo: 96 ms
New Algo: 64 ms Old Algo: 96 ms
New Algo: 64 ms Old Algo: 97 ms
I left old code commented for comparison reasons.
Do you guys see any place I can still improve within this algo design?
What about this: mov edx, 12-1 ;counter
xor ebx, ebx
bak:
inc edx
@Main:
cmp edx, tLen
jz DoneDC ; DONE?
mov bl, SBYTE Ptr [esi+edx]
test bl, bl
js Negative
jz Zero
; ------------------------- It´s Positive:
add tLineLen, 1 ; inc line num
add tSum, ebx ; Sums line (to recouver missing chunk)
mUNROLL3 ;x ; MUST HAVE ecx == 64 or 32
xor ecx, 01100000b ;x
jmp bak
; bak:
; add edx, 1
; jmp @Main
hello sir LordAdef;
Ecx holds value to be stored
mUNROLL3 MACRO ;NEW
; copy dword chunks + 1 dword
; edi is brought back overwritting the extra bytes
xor eax, eax
; .IF ecx == "@@@@" ;64
@@:
cmp eax, ebx
ja @F ; it goes beyond ebx
mov dword ptr [edi+eax],ecx ;"@@@@"
add eax, 4
jmp @B
@@:
add edi, ebx
; .ELSE
; @@:
; cmp eax, ebx ; lenght of line
; ja @F
; mov dword ptr [edi+eax]," "
; add eax, 4
; jmp @B
; @@:
; add edi, ebx
; .ENDIF
endm
You now put "@" or space into a register, so why not expand that to "@@@@" or " "?
pDecompressNEW proc, string:DWORD
LOCAL tLine:DWORD
LOCAL tLineLen :DWORD
LOCAL tSum:DWORD
LOCAL tLen :DWORD
LOCAL tChar:DWORD ; 64@ or 32" " ; NOT USING TCHAR -> ECX
mov tLine, 0
mov tLineLen, 0
mov tSum, 0
LineLenght = 160 ; Lenght of line is fixed to 160
mov edi, offset Decomp
mov esi, InputFile(string)
mov tLen, ecx
mov ecx,"@@@@" ;<-------------
mov edx, 12
xor ebx, ebx
@Main:
cmp edx, tLen
jz DoneDC
mov bl, SBYTE Ptr [esi+edx] ;carrego um byte que contém um contador
test bl, bl ;se for zero ou negativo
js Negative
jz Zero
; ------------------------- ItŽs Positive:
add tLineLen, 1
add tSum, ebx
mUNROLL3
xor ecx, 01100000011000000110000001100000b ;<------------ permute 4 @@@@ to 4 spaces or vice-versa
bak:
add edx, 1
jmp @Main
Negative:
add tLineLen, 1
neg bl
mov ecx, " " ;<-----------------
add tSum, ebx
mUNROLL3
; ----------- Missing chunk -----------
mov ebx, LineLenght ;x
xor ecx, 01100000011000000110000001100000b ;<------------ permute 4 @@@@ to 4 spaces or vice-versa
sub ebx, tSum ; Missing chunk value
mUNROLL3 ;x ; copy macro, byte per byte
mCHECKnext ; ecx is still the val of tChar. "mov tChar, ecx" here at this macro
add tLine, 1
jmp bak
Zero: ; Copy last line from edi itself. Uses MemCopy, dword.
mov eax, edi
sub eax, LineLenght
invoke MemCopy, eax, edi, LineLenght
add edi, LineLenght
mCHECKnext ;check if next val is zero -> makes LineLen & Sum = 0
add tLine, 1
mov ecx, "@@@@" ;<----------
jmp bak
DoneDC:
free esi
xor edi, edi ;edi is index
ret
pDecompressNEW endp
hey Mineiro,
you are right, it's so obvious and still it didn't click me. in fact, with that approach, I can even loose that conditional .IF ecx== since ecx already holds the value we need. We just need to copy ecx straight away.
I can't wait to time this and see if that saves 1 or 2 extra ms.
JJ idea above (jmp bak) is also very good! it saves 1 jump for every value. Clean and sutil.
Quoteafter making LineLenght an EQU
New Algo: 65 ms Old Algo: 96 ms
New Algo: 65 ms Old Algo: 98 ms
New Algo: 65 ms Old Algo: 96 ms
New Algo: 64 ms Old Algo: 96 ms
New Algo: 64 ms Old Algo: 97 ms
after ecx = "@@@@", and bak loop adjusted
New Algo: 63 ms Old Algo: 99 ms
New Algo: 62 ms Old Algo: 100 ms
New Algo: 61 ms Old Algo: 99 ms
New Algo: 62 ms Old Algo: 98 ms
New Algo: 62 ms Old Algo: 99 ms
New Algo: 61 ms Old Algo: 99 ms
New Algo: 61 ms Old Algo: 98 ms
New Algo: 61 ms Old Algo: 99 ms
The latest moves gave it a boost of about 2-3 ms. Nice!
I think that if you describe what your code should do we can help more.
I don't get a clear idea about what your code is doing, well, what negative,positive or zero piece codes should do.
If you describe so I can create a code and we can measure this function.
---edited----
something like, you're dealing with a map that have 2 chars (@ or space), so, you can fullfill a map with most used char, I'm supposing "@",so, the code only need draw(overwrite) river (space).
Have a lot of approaches to be tried.
Sorry Mineiro,
This is related to the game I am developing in the other thread. this algo reads back my RLE map backgrounf file. Since you posted there I thought you were in the loop:
http://masm32.com/board/index.php?topic=5962.msg64373#msg64373 (http://masm32.com/board/index.php?topic=5962.msg64373#msg64373)
This is what it is, with an example. the game is a River Raid recreation.
oh yes, my fault, I'm reading that code now.
Hello sir LordAdef;
Is this description right or wrong?
compressed file.dat holds a structure with 3 dword members (original size, line count and line lenght) followed by byte counters
char = @ || space
while byte counters != end of file do
if counter < 0, neg counter,count sum, store N chars, permute char, store char (160-count sum) times
if counter > 0, store N chars, permute char, count sum
if counter = 0, copy last 160 uncompressed chars (repeat last line)
Hi Mineiro,
That´s the idea.
Visually, it goes like this:
@@@@@@@@@@ @@@ @@@@@@@@@@ -> this is a line. Now, my map is set to 160 chars
10 5 3 5 x
when compressing, I loose the last chunk (mark as x) since I can put it back on decomp. I neg the chunk before the last, to mark it as my "end of line".
So, This line will be compressed like this: "10, 5, 3, -5"
Similar lines are mark as "0"
Like:
@@@@@@@@@@ @@@ @@@@@@@@@@
@@@@@@@@@@ @@@ @@@@@@@@@@
@@@@@@@@@@ @@@ @@@@@@@@@@
= 10, 5, 3, -5, 0, 0
you got it!
Abraços!
Alex