MACROs and PROCs - when to use?

MichaelW · January 02, 2013, 01:46:16 AM

In my test it depended on the CPUID function number in EAX, with the higher numbers requiring more cycles, up through 3 IIRC where it leveled out.

frktons · January 02, 2013, 10:13:28 AM

Interesting data, and intriguing strlen:

Code Select


#if defined(_MSC_VER) && (_MSC_VER != 1400 || !defined(ppc)) && (_MSC_VER != 1310) && 
(_MSC_VER != 1200) && defined(_UNICODE)
#pragma function(wcslen)
#endif

/* Calculate the length of the string pointed by pStr (excluding the
 * terminating '\0')
 */
size_t _tcslen(const _TCHAR *pStr)
{
    const _TCHAR *pEnd;

    for (pEnd = pStr; *pEnd != _TEXT('\0'); pEnd++)
        continue;

    return pEnd - pStr;
}
©2004 Microsoft Corporation. All rights reserved.

Being about 7:1 faster than repne scasb means it is well
optimized by the C compiler. And I can't figure it How?

jj2007 · January 02, 2013, 11:16:22 AM

Quote from: frktons on January 02, 2013, 10:13:28 AM
Interesting data, and intriguing strlen:

Being about 7:1 faster than repne scasb means it is well optimized by the C compiler. And I can't figure it How?

CRT strlen is only about 3 times faster than repne scasb, and it's no mystery (ecx=pString):
77C178C0 8B01 mov eax, [ecx]
77C178C2 BA FFFEFE7E mov edx, 7EFEFEFF
77C178C7 03D0 add edx, eax
77C178C9 83F0 FF xor eax, FFFFFFFF
77C178CC 33C2 xor eax, edx
77C178CE 83C1 04 add ecx, 4
77C178D1 A9 00010181 test eax, 81010100
77C178D6 74 E8 je short 77C178C0
77C178D8 8B41 FC mov eax, [ecx-4]
77C178DB 84C0 test al, al
77C178DD 74 32 je short 77C17911
77C178DF 84E4 test ah, ah
77C178E1 74 24 je short 77C17907
77C178E3 A9 0000FF00 test eax, 00FF0000
77C178E8 74 13 je short 77C178FD
77C178EA A9 000000FF test eax, FF000000
77C178EF 74 02 je short 77C178F3
77C178F1 EB CD jmp short 77C178C0
77C178F3 8D41 FF lea eax, [ecx-1]
77C178F6 8B4C24 04 mov ecx, [esp+4]
77C178FA 2BC1 sub eax, ecx
77C178FC C3 retn

frktons · January 02, 2013, 11:31:31 AM

I was talking about the performance on my pc:

Quote
Intel(R) Core(TM)2 CPU E6600 @ 2.40GHz (SSE4)
loop overhead is approx. 17/10 cycles

6617 cycles for 10 * JJ
6607 cycles for 10 * Dave
937 cycles for 10 * crt_strlen
221 cycles for 10 * MasmBasic Len

6602 cycles for 10 * JJ
6619 cycles for 10 * Dave
852 cycles for 10 * crt_strlen
220 cycles for 10 * MasmBasic Len

6611 cycles for 10 * JJ
6605 cycles for 10 * Dave
866 cycles for 10 * crt_strlen
220 cycles for 10 * MasmBasic Len

You are probably using SSE2 code to parallellize the check
for zero byte, while strlen uses probably a 4 bytes at the same time
approach, the trick of the holes.

Code Select


 mov edx, 7EFEFEFF

Of course if you have better code inside a MACRO it'll be faster
than its counterpart in a PROC, using a single byte approach. :t

The MASM Forum

News:

MACROs and PROCs - when to use?

MichaelW

frktons

jj2007

frktons