I'm not sure when it is better to use a MACRO instead
of a PROC and the contrary.
Let's assume a MACRO has some code inside, about 20 instructions,
and that MACRO is used in 10 different places during the program
execution.
Does this mean the code is repeated 10 times? Or what?
Is it better in this case to use a PROC in order to shrink program
size, but with a little overhead due to the CALL mechanism?
And in general when would you use a MACRO, and why, vs a PROC?
Thanks and happy new year.
Frank
Macros are text replacement thus each call to a code producing macro will generate code.
There is no general rule when to inline code or not - that is your decision. Commonly you have to balance between code size and speed (as larger the code get, as smaller the advantage of inline code gets). Also note that code size (of whole prog..) isn't that important today as it was in the past.
deleted
macros are intended to save you some typing - and they do
many macros call functions, so don't think that the 2 are mutually exclusive
a good example is the "print" macro
the real advantage of that macro is the flexibility - not that it is fast or that it saves bytes
take a look at the print macro in macros.asm :t
it seems very simple, but it may be used many ways
but, choosing whether to implement some code via a macro or via code may be a speed vs size decision
nidud....
push edi
sub eax,eax
mov edi,string
or ecx,-1
repne scasb
sub eax,2
pop edi
sub eax,ecx
:P
So if I need the smallest code, better to use a PROC
called from many paces. If I need a fast and flexible
solution a MACRO has to be considered.
Thanks friends, this is a good starting point.
Hi frktons,
Some tasks are archieved during assembly-time with macros. There are a lot of examples in the \masm32\macros folder.
And the speed vs size tradeoff is not always so obvious:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 17/10 cycles
4448 cycles for 10 * JJ
4445 cycles for 10 * Dave
1398 cycles for 10 * crt_strlen
360 cycles for 10 * MasmBasic Len
4433 cycles for 10 * JJ
4435 cycles for 10 * Dave
1411 cycles for 10 * crt_strlen
362 cycles for 10 * MasmBasic Len
4429 cycles for 10 * JJ
4423 cycles for 10 * Dave
1373 cycles for 10 * crt_strlen
362 cycles for 10 * MasmBasic Len
15 bytes for JJ
16 bytes for Dave
11 bytes for crt_strlen
7 bytes for MasmBasic Len
100 = eax JJ
100 = eax Dave
100 = eax crt_strlen
100 = eax MasmBasic Len
i know this is a bit off-topic :P
Jochen - i wonder how they'd compare on short strings, say, 10 bytes in length
Quote from: dedndave on January 01, 2013, 03:16:10 AM
Jochen - i wonder how they'd compare on short strings, say, 10 bytes in length
line 281: mov Src100[10], 0
4430 cycles for 10 * JJ
4427 cycles for 10 * Dave
1432 cycles for 10 * crt_strlen
360 cycles for 10 * MasmBasic Len
900 cycles for 10 * JJ
890 cycles for 10 * Dave
207 cycles for 10 * crt_strlen
154 cycles for 10 * MasmBasic Len
a bit surprising :P
Quote from: dedndave on January 01, 2013, 03:56:04 AM
a bit surprising :P
The first rows are for 100 bytes, the next ones for 10 bytes. There is a bit of overhead, so I don't find it that surprising; repne scasb is not the fastest, so I would not use it in an innermost loop with a high count...
a little playing around...
if i have a string length of 0, i get about 88 cycles
if i comment out the REPNZ SCASB, i get about 1 or 2 cycles
well - that's a little off - we can figure about 7 or 8 cycles for the overhead code
point is - that first SCASB takes ~80 cycles
after that, it's about 4 cycles per rep on shorter strings
i suspect that's because it has to test the direction flag
you may remember our previous discussions about STD and CLD taking ~80 cycles
i think the processor is slow with the DF because of protected mode
maybe we could test that theory by booting up in real mode
not that there is anything to be gained by it - lol
i wonder if things would be different in ring 0
seems to me that CPUID takes about 80 cycles, too
give you any ideas, Michael ? :biggrin:
Frank,
Macros are a convenience but much of programming IS convenience but macros are more powerful than that, if you want to you can put procs in macros just as easily as putting macro calls in procs. qword is correct here in that code size almost does not matter, in assembler it is generally small enough anyway and often the miniscule gains you get are wasted with the final link controlled section alignment. Chasing code size is a left over from DOS COM files, in 64k total it mattered, with 4 gig address space and up to 2 gig memory allocation in 32 bit code, it simply does not matter. Write good clear fast code, forget the old DOS stuff.
Quote from: dedndave on January 01, 2013, 05:48:48 AM
seems to me that CPUID takes about 80 cycles, too
Even more, it seems: 174 cycles on my Celeron.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
44097 cycles for 100 * JJ
43977 cycles for 100 * Dave
15090 cycles for 100 * crt_strlen
3678 cycles for 100 * MasmBasic Len #### round 1, 100*100 bytes ####
17462 cycles for 100 * CPUID
5928 cycles for 100 * JJ
5828 cycles for 100 * Dave
1376 cycles for 100 * crt_strlen
1553 cycles for 100 * MasmBasic Len #### round 2, 100*2 bytes ####
17459 cycles for 100 * CPUID
In my test it depended on the CPUID function number in EAX, with the higher numbers requiring more cycles, up through 3 IIRC where it leveled out.
Interesting data, and intriguing strlen:
#if defined(_MSC_VER) && (_MSC_VER != 1400 || !defined(ppc)) && (_MSC_VER != 1310) &&
(_MSC_VER != 1200) && defined(_UNICODE)
#pragma function(wcslen)
#endif
/* Calculate the length of the string pointed by pStr (excluding the
* terminating '\0')
*/
size_t _tcslen(const _TCHAR *pStr)
{
const _TCHAR *pEnd;
for (pEnd = pStr; *pEnd != _TEXT('\0'); pEnd++)
continue;
return pEnd - pStr;
}
©2004 Microsoft Corporation. All rights reserved.
Being about 7:1 faster than repne scasb means it is well
optimized by the C compiler. And I can't figure it How?
Quote from: frktons on January 02, 2013, 10:13:28 AM
Interesting data, and intriguing strlen:
Being about 7:1 faster than repne scasb means it is well optimized by the C compiler. And I can't figure it How?
CRT strlen is only about 3 times faster than repne scasb, and it's no mystery (ecx=pString):
77C178C0 8B01 mov eax, [ecx]
77C178C2 BA FFFEFE7E mov edx, 7EFEFEFF
77C178C7 03D0 add edx, eax
77C178C9 83F0 FF xor eax, FFFFFFFF
77C178CC 33C2 xor eax, edx
77C178CE 83C1 04 add ecx, 4
77C178D1 A9 00010181 test eax, 81010100
77C178D6 74 E8 je short 77C178C0
77C178D8 8B41 FC mov eax, [ecx-4]
77C178DB 84C0 test al, al
77C178DD 74 32 je short 77C17911
77C178DF 84E4 test ah, ah
77C178E1 74 24 je short 77C17907
77C178E3 A9 0000FF00 test eax, 00FF0000
77C178E8 74 13 je short 77C178FD
77C178EA A9 000000FF test eax, FF000000
77C178EF 74 02 je short 77C178F3
77C178F1 EB CD jmp short 77C178C0
77C178F3 8D41 FF lea eax, [ecx-1]
77C178F6 8B4C24 04 mov ecx, [esp+4]
77C178FA 2BC1 sub eax, ecx
77C178FC C3 retn
I was talking about the performance on my pc:
Quote
Intel(R) Core(TM)2 CPU E6600 @ 2.40GHz (SSE4)
loop overhead is approx. 17/10 cycles
6617 cycles for 10 * JJ
6607 cycles for 10 * Dave
937 cycles for 10 * crt_strlen
221 cycles for 10 * MasmBasic Len
6602 cycles for 10 * JJ
6619 cycles for 10 * Dave
852 cycles for 10 * crt_strlen
220 cycles for 10 * MasmBasic Len
6611 cycles for 10 * JJ
6605 cycles for 10 * Dave
866 cycles for 10 * crt_strlen
220 cycles for 10 * MasmBasic Len
You are probably using SSE2 code to parallellize the check
for zero byte, while strlen uses probably a 4 bytes at the same time
approach, the trick of the holes.
mov edx, 7EFEFEFF
Of course if you have better code inside a MACRO it'll be faster
than its counterpart in a PROC, using a single byte approach. :t