News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Generate all combinations from given alphabet

Started by Vincent, January 20, 2018, 09:36:06 AM

Previous topic - Next topic

Vincent

Good day,

This is my first post, and I'm not a native english speaker, so please excuse possible mistakes of mine. For introduction - I'm not a great low level programmer, but I love MASM and I'm using it for at least 10 years and wrote many windows applications including commercial ones.

Recently, I've needed a combination generator, so I wrote one in MASM. Speed was not very important, and job was done. But after job was done, I became interested in making this simple thing faster, just because I have some passion for perfection.

Idea is that it gives back combinations from given alphabet, for example "AAA","AAB","AAC" and so on.

I've put some effort and soon result became around 7 cycles per combination (maybe can be a liitle faster without function call). Here is code:

.686
.model flat,stdcall
option casemap:none

include \masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc

includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

NextCombination PROTO

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

.data

ALPHABET        db "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
align 4
ALPHABET_SIZE   dd 62
align 4
COMB_LEN        dd 4            ;Actual combination length - 1
align 4
COMB            db 0,0,0,0,-1   ;Initial alphabet indexes
align 4
SZCOMB          db "00000",0    ;Initial zero terminated string
; -------------------------------------------------------------------------
format          db "%u",0

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

.data?

StartTime dd ?
buffer    db 64 dup(?)

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

.code
start:

mov ebx, 0FFFFFFFFh     ;CPU preheat ;) Helps to settle down throttling
@@:
sub ebx, 1
jnz @B
; -------------------------------------------------------------------------
invoke GetTickCount
mov StartTime, eax
@@:
invoke NextCombination   ;eax = pointer to SZCOMB or 0 if no more combinations
test eax, eax
jnz @B
invoke GetTickCount
sub eax, StartTime
invoke wsprintf, addr buffer, addr format, eax
invoke MessageBox, 0, addr buffer, 0, MB_OK
invoke ExitProcess, 0

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4
NextCombination proc

mov ecx, COMB_LEN
mov eax, offset SZCOMB
;----------------------------
@@:
add byte ptr[COMB+ecx], 1
movzx edx, byte ptr[COMB+ecx]
cmp edx, ALPHABET_SIZE
mov dl, [ALPHABET+edx]
mov [eax+ecx], dl
jl Exit
test ecx, ecx
jz NoMore
;----------------------------
and byte ptr[COMB+ecx], ch
mov dl, [ALPHABET]
mov [eax+ecx], dl
sub ecx, 1
jmp @B
;----------------------------
NoMore:
xor eax, eax
;----------------------------
Exit:

ret
NextCombination endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start


I was quite happy with my optimizations (it became reasonably quick), until I found this link:

https://codereview.stackexchange.com/questions/38474/brute-force-algorithm-in-c

Here someone wrote same thing in C using a bit different algorythm and got 0.4 seconds for 5 character length combinations (alphabet size 62). So it is reasonable to say, that his code achieves 1-2 cycles per combination on average. My code on a 2.9Ghz boosting Core M 5Y71 CPU takes around 2.4 seconds.

I have not tried to write that algo in MASM, I desided to ask opinion from many experienced programmers here. Can it really be so much faster? Some suggestions to do even better?

P.S. even one increment of memory location takes 1.9 seconds in a loop of 62^5 iterations on my PC.

mov esi, offset COMB
mov ebx, 916132832  ;62^5
@@:
add byte ptr[esi], 1
sub ebx, 1
jnz @B


If someone is interested, I will be happy to receive comments. Hope this subject or even my code can be useful to someone.

Thank you all.

jj2007

Hi Vincent,

Welcome to the Forum :icon14:

Your algo looks OK, I had not much time to test anything but I could imagine that you could speed it up a bit by moving one of the cmp+jmp combinations to an outer loop:NextCombination proc
  mov ecx, COMB_LEN
  mov eax, offset SZCOMB

  @@:
add byte ptr[COMB+ecx], 1
movzx edx, byte ptr[COMB+ecx]
cmp edx, ALPHABET_SIZE ; 1st cmp
mov dl, [ALPHABET+edx]
mov [eax+ecx], dl
jl Exit
test ecx, ecx ; 2nd cmp
jz NoMore

and byte ptr[COMB+ecx], ch
mov dl, [ALPHABET]
mov [eax+ecx], dl
sub ecx, 1
jmp @B

  NoMore:
  xor eax, eax

Exit:
  ret
NextCombination endp


For example, the
   sub ecx, 1
   jmp @B

could be replaced by
   dec ecx
   jns @B

and would eliminate the need for the test ecx, ecx.

felipe

Welcome! You can align the data (all of them not just the dwords). But in order i.e. firts the bytes and then the dwords.  Don't call your function, just execute it in the main procedure. Also i was told than aligning the procedures can have a negative performance impact but i don't remember why. Maybe you should try OR ebx,0ffffffffh as the first instruction instead of the move. Or using the move with eax instead of ebx. Btw is that loop really necessary?
Probably there are a lot of things you can try. Good luck! :biggrin:

daydreamer

I think you should do a macro instead of invoke where you could it reduces call and return opcodes and and work in the buffer all the time until its finished you output to a windows textfield instead of all printing
if cpu's are multicore, create a worker thread to do half your work simultanously on second core
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

aw27

#4
The tests are not equivalent.
The generate() function has much less iteration loops than the number of calls to your NextCombination proc. Make counters to confirm. Although NextCombination  could be inlined to speed up a bit, the real problem is not that.
We are comparing different algorithms, or tomatoes with oranges, nothing to do with MASM.

Vincent

Thanks to all for replies. I will play around a bit more and try out suggestions. Interestingly, I remember I tried to take code out of procedure, but it made no difference. Maybe instruction pairing cancels call/retn overhead in this code? But yes, macro can be really nice here :) Runs fast, but still looks nice :)

Anyway, I think it is true, that choosing the best algo is the most important in any optimization. It can make dramatic difference. And I became confused when I saw an example in C. Ever it is mistake, or some magic I dont understand. So if it is not a mistake, it would be interesting to get opinions why is it so much faster?

Say it is 5 characters length. Say alphabet is 62 symbols. He makes an array of 62^2 patterns with precalculated 2 last positions. Then increments 3 first positions, so 62^3 increments. Looks nice, BUT, if I understand it correctly, he is writing incremented symbol to all patterns in an array using loop, so it is 62^2 writes to that array for each increment. How all this makes it so much faster is beyond my understanding at this moment. Around 6 times (!) faster then my code.

On my PC this takes 328ms

mov ebx, 916132832  ;62^5
@@:
sub ebx, 1
jnz @B


Anything more takes longer. Especialy if it is a write to memory. Then it jums to 2000ms and more for this loop. But how to iterate less then 62^5 or do no memory writes I dont understand. Writer of C code claims 400ms for total running time producing 62^5 combinations.

As I said before, if someone is interested to analyze this situation :)

jj2007

I must admit I have not fully understood the logic and the expected output. An example might be helpful.

If it is just creating sequences like
AAAA
BBBB
CCCC
..
ZZZZ
then packed SIMD instructions could give you a huge speed increase.

aw27

I would check by myself instead of speculating on something I did not even try.
You will be surprised how wrong you are on your assumptions. I am assuming you know how to compile a program in C.

Vincent

jj2007, here is example.

1. Say alphabet is "0123456789" (alphabet size here is 10)
2. Say length of combination pattern is 3

Produce all possible paterns as zero terminated strings like this:

000, 001, 002, ..., 009

then,

010, 011, 012, ..., 019

until

990, 991, 992, ..., 999 and finished.

Alphabet = any set of characters (it can be not contiguous in ASCII table, for example "0123456789ABCDF" "A" is not after "9" in ASCII table)
Combination length = any reasonable length

Hope this a good example.

Vincent

aw27, I'm really sorry that my questions are based on speculation. Yes, I know C reasonably well and I can get compiler and compile. Maybe I was just expecting someone to put theoretical light on doing such algo, or maybe say that "I did such algo years ago and yes, you can do with precalculation a lot more efficient and thats because of <something>". If no, then it is ok, I can always work on my own.

aw27

The problem is that you are unable to figure something evident just by looking at the code, you are not willing to compile the C program, and you are no able to figure out that its is simply impossible a program to print 916132832 times to the screen in 400 miliseconds when your program prints only once and takes a few times more. God bless you!  :t

jj2007

What about creating 4 at a time:
abcd
bcde
...
wxyz

Could be done with an add eax, 01010101h, plus some polishing for aaaa, zzzz etc.
Unfortunately I won't have time to test or design anything, so it's your turn ;-)

Vincent

aw27, this is some serious miscommunication. God bless communication ;)