Good day,
This is my first post, and I'm not a native english speaker, so please excuse possible mistakes of mine. For introduction - I'm not a great low level programmer, but I love MASM and I'm using it for at least 10 years and wrote many windows applications including commercial ones.
Recently, I've needed a combination generator, so I wrote one in MASM. Speed was not very important, and job was done. But after job was done, I became interested in making this simple thing faster, just because I have some passion for perfection.
Idea is that it gives back combinations from given alphabet, for example "AAA","AAB","AAC" and so on.
I've put some effort and soon result became around 7 cycles per combination (maybe can be a liitle faster without function call). Here is code:
.686
.model flat,stdcall
option casemap:none
include \masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
NextCombination PROTO
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
ALPHABET db "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
align 4
ALPHABET_SIZE dd 62
align 4
COMB_LEN dd 4 ;Actual combination length - 1
align 4
COMB db 0,0,0,0,-1 ;Initial alphabet indexes
align 4
SZCOMB db "00000",0 ;Initial zero terminated string
; -------------------------------------------------------------------------
format db "%u",0
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data?
StartTime dd ?
buffer db 64 dup(?)
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.code
start:
mov ebx, 0FFFFFFFFh ;CPU preheat ;) Helps to settle down throttling
@@:
sub ebx, 1
jnz @B
; -------------------------------------------------------------------------
invoke GetTickCount
mov StartTime, eax
@@:
invoke NextCombination ;eax = pointer to SZCOMB or 0 if no more combinations
test eax, eax
jnz @B
invoke GetTickCount
sub eax, StartTime
invoke wsprintf, addr buffer, addr format, eax
invoke MessageBox, 0, addr buffer, 0, MB_OK
invoke ExitProcess, 0
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
NextCombination proc
mov ecx, COMB_LEN
mov eax, offset SZCOMB
;----------------------------
@@:
add byte ptr[COMB+ecx], 1
movzx edx, byte ptr[COMB+ecx]
cmp edx, ALPHABET_SIZE
mov dl, [ALPHABET+edx]
mov [eax+ecx], dl
jl Exit
test ecx, ecx
jz NoMore
;----------------------------
and byte ptr[COMB+ecx], ch
mov dl, [ALPHABET]
mov [eax+ecx], dl
sub ecx, 1
jmp @B
;----------------------------
NoMore:
xor eax, eax
;----------------------------
Exit:
ret
NextCombination endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
I was quite happy with my optimizations (it became reasonably quick), until I found this link:
https://codereview.stackexchange.com/questions/38474/brute-force-algorithm-in-c (https://codereview.stackexchange.com/questions/38474/brute-force-algorithm-in-c)
Here someone wrote same thing in C using a bit different algorythm and got 0.4 seconds for 5 character length combinations (alphabet size 62). So it is reasonable to say, that his code achieves 1-2 cycles per combination on average. My code on a 2.9Ghz boosting Core M 5Y71 CPU takes around 2.4 seconds.
I have not tried to write that algo in MASM, I desided to ask opinion from many experienced programmers here. Can it really be so much faster? Some suggestions to do even better?
P.S. even one increment of memory location takes 1.9 seconds in a loop of 62^5 iterations on my PC.
mov esi, offset COMB
mov ebx, 916132832 ;62^5
@@:
add byte ptr[esi], 1
sub ebx, 1
jnz @B
If someone is interested, I will be happy to receive comments. Hope this subject or even my code can be useful to someone.
Thank you all.
Hi Vincent,
Welcome to the Forum :icon14:
Your algo looks OK, I had not much time to test anything but I could imagine that you could speed it up a bit by moving one of the cmp+jmp combinations to an outer loop:NextCombination proc
mov ecx, COMB_LEN
mov eax, offset SZCOMB
@@:
add byte ptr[COMB+ecx], 1
movzx edx, byte ptr[COMB+ecx]
cmp edx, ALPHABET_SIZE ; 1st cmp
mov dl, [ALPHABET+edx]
mov [eax+ecx], dl
jl Exit
test ecx, ecx ; 2nd cmp
jz NoMore
and byte ptr[COMB+ecx], ch
mov dl, [ALPHABET]
mov [eax+ecx], dl
sub ecx, 1
jmp @B
NoMore:
xor eax, eax
Exit:
ret
NextCombination endp
For example, the
sub ecx, 1
jmp @B
could be replaced by
dec ecx
jns @B
and would eliminate the need for the test ecx, ecx.
Welcome! You can align the data (all of them not just the dwords). But in order i.e. firts the bytes and then the dwords. Don't call your function, just execute it in the main procedure. Also i was told than aligning the procedures can have a negative performance impact but i don't remember why. Maybe you should try OR ebx,0ffffffffh as the first instruction instead of the move. Or using the move with eax instead of ebx. Btw is that loop really necessary?
Probably there are a lot of things you can try. Good luck! :biggrin:
I think you should do a macro instead of invoke where you could it reduces call and return opcodes and and work in the buffer all the time until its finished you output to a windows textfield instead of all printing
if cpu's are multicore, create a worker thread to do half your work simultanously on second core
The tests are not equivalent.
The generate() function has much less iteration loops than the number of calls to your NextCombination proc. Make counters to confirm. Although NextCombination could be inlined to speed up a bit, the real problem is not that.
We are comparing different algorithms, or tomatoes with oranges, nothing to do with MASM.
Thanks to all for replies. I will play around a bit more and try out suggestions. Interestingly, I remember I tried to take code out of procedure, but it made no difference. Maybe instruction pairing cancels call/retn overhead in this code? But yes, macro can be really nice here :) Runs fast, but still looks nice :)
Anyway, I think it is true, that choosing the best algo is the most important in any optimization. It can make dramatic difference. And I became confused when I saw an example in C. Ever it is mistake, or some magic I dont understand. So if it is not a mistake, it would be interesting to get opinions why is it so much faster?
Say it is 5 characters length. Say alphabet is 62 symbols. He makes an array of 62^2 patterns with precalculated 2 last positions. Then increments 3 first positions, so 62^3 increments. Looks nice, BUT, if I understand it correctly, he is writing incremented symbol to all patterns in an array using loop, so it is 62^2 writes to that array for each increment. How all this makes it so much faster is beyond my understanding at this moment. Around 6 times (!) faster then my code.
On my PC this takes 328ms
mov ebx, 916132832 ;62^5
@@:
sub ebx, 1
jnz @B
Anything more takes longer. Especialy if it is a write to memory. Then it jums to 2000ms and more for this loop. But how to iterate less then 62^5 or do no memory writes I dont understand. Writer of C code claims 400ms for total running time producing 62^5 combinations.
As I said before, if someone is interested to analyze this situation :)
I must admit I have not fully understood the logic and the expected output. An example might be helpful.
If it is just creating sequences like
AAAA
BBBB
CCCC
..
ZZZZ
then packed SIMD instructions could give you a huge speed increase.
I would check by myself instead of speculating on something I did not even try.
You will be surprised how wrong you are on your assumptions. I am assuming you know how to compile a program in C.
jj2007, here is example.
1. Say alphabet is "0123456789" (alphabet size here is 10)
2. Say length of combination pattern is 3
Produce all possible paterns as zero terminated strings like this:
000, 001, 002, ..., 009
then,
010, 011, 012, ..., 019
until
990, 991, 992, ..., 999 and finished.
Alphabet = any set of characters (it can be not contiguous in ASCII table, for example "0123456789ABCDF" "A" is not after "9" in ASCII table)
Combination length = any reasonable length
Hope this a good example.
aw27, I'm really sorry that my questions are based on speculation. Yes, I know C reasonably well and I can get compiler and compile. Maybe I was just expecting someone to put theoretical light on doing such algo, or maybe say that "I did such algo years ago and yes, you can do with precalculation a lot more efficient and thats because of <something>". If no, then it is ok, I can always work on my own.
The problem is that you are unable to figure something evident just by looking at the code, you are not willing to compile the C program, and you are no able to figure out that its is simply impossible a program to print 916132832 times to the screen in 400 miliseconds when your program prints only once and takes a few times more. God bless you! :t
What about creating 4 at a time:
abcd
bcde
...
wxyz
Could be done with an add eax, 01010101h, plus some polishing for aaaa, zzzz etc.
Unfortunately I won't have time to test or design anything, so it's your turn ;-)
aw27, this is some serious miscommunication. God bless communication ;)