News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

MACROs and PROCs - when to use?

Started by frktons, January 01, 2013, 12:35:16 AM

Previous topic - Next topic

MichaelW

In my test it depended on the CPUID function number in EAX, with the higher numbers requiring more cycles, up through 3 IIRC where it leveled out.
Well Microsoft, here's another nice mess you've gotten us into.

frktons

Interesting data, and intriguing strlen:

#if defined(_MSC_VER) && (_MSC_VER != 1400 || !defined(ppc)) && (_MSC_VER != 1310) &&
(_MSC_VER != 1200) && defined(_UNICODE)
#pragma function(wcslen)
#endif

/* Calculate the length of the string pointed by pStr (excluding the
* terminating '\0')
*/
size_t _tcslen(const _TCHAR *pStr)
{
    const _TCHAR *pEnd;

    for (pEnd = pStr; *pEnd != _TEXT('\0'); pEnd++)
        continue;

    return pEnd - pStr;
}
©2004 Microsoft Corporation. All rights reserved.

Being about 7:1 faster than repne scasb means it is well
optimized by the C compiler. And I can't figure it How?
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: frktons on January 02, 2013, 10:13:28 AM
Interesting data, and intriguing strlen:

Being about 7:1 faster than repne scasb means it is well optimized by the C compiler. And I can't figure it How?

CRT strlen is only about 3 times faster than repne scasb, and it's no mystery (ecx=pString):
77C178C0        8B01                 mov eax, [ecx]
77C178C2        BA FFFEFE7E          mov edx, 7EFEFEFF
77C178C7        03D0                 add edx, eax
77C178C9        83F0 FF              xor eax, FFFFFFFF
77C178CC        33C2                 xor eax, edx
77C178CE        83C1 04              add ecx, 4
77C178D1        A9 00010181          test eax, 81010100
77C178D6       74 E8                je short 77C178C0
77C178D8        8B41 FC              mov eax, [ecx-4]
77C178DB        84C0                 test al, al
77C178DD       74 32                je short 77C17911
77C178DF        84E4                 test ah, ah
77C178E1       74 24                je short 77C17907
77C178E3        A9 0000FF00          test eax, 00FF0000
77C178E8       74 13                je short 77C178FD
77C178EA        A9 000000FF          test eax, FF000000
77C178EF       74 02                je short 77C178F3
77C178F1       EB CD                jmp short 77C178C0
77C178F3        8D41 FF              lea eax, [ecx-1]
77C178F6        8B4C24 04            mov ecx, [esp+4]
77C178FA        2BC1                 sub eax, ecx
77C178FC        C3                   retn

frktons

I was talking about the performance on my pc:
Quote
Intel(R) Core(TM)2 CPU  E6600  @ 2.40GHz (SSE4)
loop overhead is approx. 17/10 cycles

6617    cycles for 10 * JJ
6607    cycles for 10 * Dave
937     cycles for 10 * crt_strlen
221     cycles for 10 * MasmBasic Len

6602    cycles for 10 * JJ
6619    cycles for 10 * Dave
852     cycles for 10 * crt_strlen
220     cycles for 10 * MasmBasic Len

6611    cycles for 10 * JJ
6605    cycles for 10 * Dave
866     cycles for 10 * crt_strlen
220     cycles for 10 * MasmBasic Len

You are probably using SSE2 code to parallellize the check
for zero byte, while strlen uses probably a 4 bytes at the same time
approach, the trick of the holes.

mov edx, 7EFEFEFF

Of course if you have better code inside a MACRO it'll be faster
than its counterpart in a PROC, using a single byte approach.  :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama