These are a set of command line parsing algorithms that are designed to be simple to use. There is an algorithm that counts arguments passed on the command line that use the ";" character as the delimiter and 5 versions depending on the arg count. Later there will be a couple of array based versions, one using a delimiter and the other that uses a space based delimiter with quotes support for arguments that have embedded spaces.
They are testing up well at the moment and when finished testing, they are library candidates.
A few questions about arg_cnt
mov r10, rcx
...
mov r11, rcx
Why not use RCX to begin with?
Unless there are more than 4 billion args, isn't it smaller and faster to use EAX (or even AL) instead of RAX?
Nitpick: rdx = delimiter, really DL = delimiter since RDX isn't actually used.
What about unicode?
The calling convention will use RDX as the second arg anyway so its simple enough to use DL in the loop code.
> isn't it smaller and faster to use EAX (or even AL) instead of RAX?
It may be smaller but not faster, right from the beginning with 64 bit code I have used native sized 64 bit registers where possible to avoid mixed size code as I only build /LARGEADDRESSAWARE code. You are right that RCX could have been used and it would drop 1 instruction but its not in the loop code so it not like it matters much. These are candidates for the library so there is some room to tweak them a little further.
It is as much habit that I start with the high registers first from r11 down, it was to shift from the old habits in 32 bit where you had far fewer registers to work with.
@Steve H
/LARGEADDRESSAWARE is the default for x64 for linkers and the only effective option for x64 is /LARGEADDRESSAWARE:NO
I have heard this but MASM is not a Microsoft C compiler, usually people set the NO option when they try to use a mnemonic that is not allowed in x64. The solution is to learn the correct codings that work in x64.
sinsi,
Here is the next version, its a bit smaller and cleaner but its not like it matters much.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
LOCAL sPtr :QWORD
LOCAL acnt :QWORD
sas sPtr," 0ne ; two ; three ; four ; five;six ; seven ; eight"
mov acnt, rvcall(arg_cnt,sPtr)
conout "Argument count = ",str$(acnt),lf
waitkey
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
arg_cnt proc
; -------------------------------
; counts ";" delimiter and adds 1
; rcx = string address
; -------------------------------
cmp BYTE PTR [rcx], 0 ; exit with error on null string
je error
xor rax, rax ; set rax to 0
sub rcx, 1
@@:
add rcx, 1
cmp BYTE PTR [rcx], 0 ; test for terminator
je @F
cmp BYTE PTR [rcx], 59 ; test if delimiter
jne @B
add rax, 1 ; increment the arg count
jmp @B
@@:
add rax, 1 ; return delimiter count + 1
ret
error:
xor rax, rax ; exit on empty string with rax = 0
ret
arg_cnt endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
>but its not like it matters much
Every little bit helps...
Yeah... but x64 instructions are notoriously longer. An inc rcx is just one byte shorter than add rcx, 1. So at max the routine could be only 5 bytes shorter, including a dirty hack like cmp BYTE PTR [rcx], ah ; test for terminator ;)
Here is the extra short version, 28 bytes:
arg_cnt:
; -------------------------------
; counts ";" delimiter and adds 1
; rcx = string address
; -------------------------------
xor eax, eax ; set rax to 0
cmp BYTE PTR [rcx], al ; exit with error on null string
je error
dec rcx
@@:
inc rcx
cmp BYTE PTR [rcx], ah ; test for terminator
je @F
cmp BYTE PTR [rcx], 59 ; test if delimiter
jne @B
inc eax ; increment the arg count
jmp @B
@@:
inc eax ; return delimiter count + 1
error:
ret 0
I don't particularly lose any sleep over the instruction length because 64 bit processors run 64 bit instructions natively. Yes you can often use a 32 bit register but the processor still reads the 64 bit register so the only gain was trying to port 32 bit legacy code to 64 bit.
Well, it's 41->28 bytes. Speedwise it may become a problem if you frequently use the function in loops with a Million iterations. Btw with optimisations on, 64-bit compilers use these tricks with 32-bit registers.
I was thinking more along the lines of extra bytes adding up over all procs, not calling one longer proc a million times.
If you are using RAX when a 32-bit number is all you ever use, it doubles the instruction length (I would imagine speed stays the same).
Over a few thousand lines of code it could add up to a few cache misses etc.
cmp BYTE PTR [rcx], ah
Too bad if there are more than 255 args (yes I have seen this once) :biggrin:
Quote from: hutch-- on March 10, 2019, 10:38:25 PM
I have heard this but MASM is not a Microsoft C compiler, usually people set the NO option when they try to use a mnemonic that is not allowed in x64. The solution is to learn the correct codings that work in x64.
Not a cl issue, as the command line option
/LARGEADDRESSAWARE alone is quite useless with x64, no effect at all, as it is a default.
Just use a PEView or similar to check IMAGE_FILE_LARGE_ADDRESS_AWARE bit after linking.
From long ago I have kept hearing the assumption that shorter code in terms of byte length is supposed to be faster but the clock does not agree with that assumption as it was based off the pre-i486 hardware in 16 bit real mode. It vaguely mattered in MS-DOS COM files but with the i486 came pipelines and the beginning of instruction scheduling and instructions are still read complete, not in part. An instruction muncher does not care about the instruction size as long as the processor and OS version are capable of reading that instruction. I am still fascinated why MS-DOS assumptions linger on in 32 and 64 bit coding practices.
Whatever the perceived advantages may happen to be with any given algo design, look at the alignment padding at the end of the procedure and you have thrown most of it away most of the time. If an algorithm is so long that it is effected by cache problems, redesign it and the ultimate test is as usual, the clock, the rest does not matter.
RE : C compiler optimisation.
It was not that long ago that optimising C compilers loaded an immediate into a register before performing an ADD, as before the clock is the one that matters.
Timo, this is what Microsoft have to say about /LARGEADDRESSAWARE
Quote
The /LARGEADDRESSAWARE option tells the linker that the application can handle addresses larger than 2 gigabytes. In the 64-bit compilers, this option is enabled by default. In the 32-bit compilers, /LARGEADDRESSAWARE:NO is enabled if /LARGEADDRESSAWARE is not otherwise specified on the linker line.
If the only gain is in typing a shorter link line, I don't have a problem with it as I auto generate the batch files I use to build 64 bit MASM binaries.
I also like that my cpu is a 3.5ghz+ number cruncher and my favourite kind of instruction sets are the biggest ones,but I believe the newest cpus are better designed to handle large opcodes+largedata size and faster calculate numbers
and /LARGEADRESSAWARE, I really like to see a masm64 or masm32 program to take advantage of that kind of memorysize allocation,not here but probably in game forum maybe reading in compressed data and store uncompressed
The issue is not the length of an instruction. If it's in the cache, it doesn't matter if it's a one-byter like inc eax or lodsb. But if a loop doesn't fit into the instruction cache, the cpu must reload the instructions, and that costs obviously cycles. Therefore shorter instructions allow more complex loops.
https://stackoverflow.com/questions/22921373/how-to-write-instruction-cache-friendly-program-in-c
I am much of the view that if an algorithm does not fit into the cache, it either needs to be re-written or it is that long by necessity then there is little you can do about it. When you use old instructions like LODSB without the REP prefix you get a serious performance penalty which byte level reduction will not compensate for. With loop code there is a simple technique to test cache effects, unroll the loop until the loop timing gets slower then back it off until its near its fastest, slightly less is better as different processors respond differently to how much loop unrolling is used.
ALA Intel manual
In 64-bit mode, INC r16 and INC r32 are not encodable (because opcodes 40H through 47H are REX prefixes).
Otherwise, the instruction's 64-bit mode default operation size is 32 bits. Use of the REX.R prefix permits access to
additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.
INC DEC versus ADD SUB. See this link for a discussion on it.
https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter
I read the article you posted but its old stuff aimed at people writing C/C++ code, applying this C/C++ theory to direct assembler is like the cart pushing the horse. Branch reduction has been with us for a long time, so has instruction count reduction and the one that really matters is memory operand reduction as you will see this when timing an algorithm.
This is probably the version I will add to the library for counting arguments separated by a delimiter. An extra instruction in the loop but only 1 memory read per iteration and full 64 bit registers with no partial register reads or writes.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
count_args proc
; -------------------------------
; counts ";" delimiter and adds 1
; rcx = string address
; -------------------------------
cmp BYTE PTR [rcx], 0 ; exit with error on null string
je error
xor rax, rax ; set rax to 0
sub rcx, 1
@@:
add rcx, 1
movzx rdx, BYTE PTR [rcx] ; zero extend byte to rdx
test rdx, rdx ; test for terminator
jz @F ; exit loop on 0
cmp rdx, 59 ; test if delimiter
jnz @B ; loop back if not delimiter
add rax, 1 ; increment the arg count
jmp @B ; loop back
@@:
add rax, 1 ; return delimiter count + 1
ret
error:
xor rax, rax ; exit on empty string with rax = 0
ret
count_args endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
xor eax,eax
cmp [rcx],al
jz done
@@:
mov dl,[rcx]
test dl,dl
jz @F
inc rcx
cmp dl,";"
jnz @B
inc eax
jmp @B
@@:
inc eax
done:
ret
edited: to return 0 on empty string
There is only one integer register size on a 64 bit processor, its 64 bit which with the accumulator is rax. You get eax, ax and al through masking of rax. I avoid this by sticking with the native register size which in this context is 64 bit. Write an algo of either type and put them into an app then look at the padding after either proc and your gain in size reduction is wasted.
The default operand size is still 32 bit, so using 64 bits unnecessarily imposes a penalty - not speed but size.
They even made it easy for us, "xor rax,rax" is exactly the same as "xor eax,eax" but needs the REX prefix byte.
In the context of a 10,000 line program it can make a big difference.
I just think it's a bad habit to get into. Don't get blinded by "64 bits".
:biggrin:
Relying on a one horse trick to port 32 bit to 64 bit is itself a risky path. XOR does work the same with EAX but not with AX or AL. The only gain you can get is a reduction in the size of some instructions but as I have commented before, you lose that in most instances with the alignment padding after each procedure. With normal 16 byte alignment you have to try and save enough bytes at whatever cost to avoid the next 16 byte boundary and unless the procedure is long enough to do that you gain nothing.
64 bit code does end up a bit larger but nothing like twice the size of comparable 32 bit code and it has the great advantages of many more integer registers and far more memory that the effective 2 gig limit of single allocation of 32 bit. From memory a Win10 64 box with enough memory can address up to 128 gig. I can routinely test on 32 gig and you would be surprised just how slow traditional 32 bit design is on memory that big.
I am still fascinated that the pre i486 style of code lingers on, the days of real mode pre-286 are long over and while it did make some difference with DOS COM files if you were up near the 64k limit, its a waste of time on anything from the i486 upwards. I remember some folks waxing lyrical about using short jumps to save space but it never went any faster.
hutch—
When do you decide to rename your environment to masm64? Let it be version 1. Everybody really needs it. Best regards, Alex
I got your last "reminder" but its the same problem, not only do I NOT own the domain name but it would involve rewriting all of the source code as it intentionally uses hard coded paths to avoid picking up the incorrect binaries, libraries and the like.
Quote from: Alex81524 on March 26, 2019, 08:31:46 PM
hutch—
When do you decide to rename your environment to masm64? Let it be version 1. Everybody really needs it. Best regards, Alex
I am waiting as well for Microsoft to rename the System32 folder to System64 and update kernel32.dll to kernel64.dll. Until then we will never have a real 64-bit operating system. :lol:
Quote from: hutch-- on March 26, 2019, 11:01:17 PMas it intentionally uses hard coded paths to avoid picking up the incorrect binaries, libraries and the like.
Given the incredible mess with "flexible" paths in other languages, I herewith suggest Hutch for the Nobel Price for Relaxed Programming 8)
Google
"C++" "path" "not found": About 1,600,000 results :bgrin: