News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

StrLen for Masm64 SDK

Started by jj2007, September 08, 2023, 07:38:38 AM

Previous topic - Next topic

jj2007

Can I have some timings, please? Pure Masm64 SDK code attached.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10 mega iterations, 10240 instructions
1080 megacycles for szLen (Masm64 SDK)
269 megacycles for szLenSSE

1067 megacycles for szLen (Masm64 SDK)
268 megacycles for szLenSSE

1087 megacycles for szLen (Masm64 SDK)
268 megacycles for szLenSSE

1111 megacycles for szLen (Masm64 SDK)
269 megacycles for szLenSSE

1073 megacycles for szLen (Masm64 SDK)
260 megacycles for szLenSSE

63 bytes for szLen
31 bytes for szLenSSE

jj2007

Only a factor 3 for my AMD:
AMD Athlon Gold 3150U with Radeon Graphics
10 mega iterations, 10240 instructions
1088 megacycles for szLen (Masm64 SDK)
354 megacycles for szLenSSE

1084 megacycles for szLen (Masm64 SDK)
365 megacycles for szLenSSE

1102 megacycles for szLen (Masm64 SDK)
330 megacycles for szLenSSE

1035 megacycles for szLen (Masm64 SDK)
323 megacycles for szLenSSE

1106 megacycles for szLen (Masm64 SDK)
321 megacycles for szLenSSE

63 bytes for szLen
31 bytes for szLenSSE

lingo

Again useless test between 2 slow and slowest StrLen algos. :nie:

szLenSSE proc                 ; 31 bytes; JJ 7 Sept 23
        xor     eax, eax
        xorps   xmm1, xmm1
@@: movups  xmm0, [rcx+rax]
pcmpeqb xmm0, xmm1 ; any nullbytes in there?
add eax, 16
pmovmskb edx, xmm0 ; show them in edx
bsf edx, edx         ; get index of first nullbyte
je @B
        lea eax, [edx+eax-16] ; return len(rcx)
        ret
szLenSSE endp

Notes:
1. Un usage of movups slow instruction ONLY (no attempt to including  in algo movdqa or movaps)
2. movups xmm0, [rcx+rax] is slower then movups xmm0, [rax]
3. Un usage of two instructions for the result of comparing: I. movups xmm0, [rcx+rax] II. pcmpeqb xmm0, xmm1
4. Very bad!!  :eusa_pray: Very slow instruction bsf edx, edx is included in the loop!??   :eusa_naughty:

My algo Strlen will win easily.

Align 16
db 7 dup (90h)
StrLenA_Lingo         proc
movdqu     xmm1, [rax]
pxor     xmm0, xmm0
pcmpeqb     xmm1, xmm0
pmovmskb    ecx,  xmm1
test     ecx,  ecx
jne     @L_End
mov     edx,  eax
and     rax,  -16
@@:
pcmpeqb     xmm0, [rax+16]
pcmpeqb     xmm1, [rax+32]
por     xmm1, xmm0
pmovmskb    ecx,  xmm1
test     ecx,  ecx
lea     eax,  [eax+32]
jz     @b
shl     ecx,  16
sub     eax,  edx
pmovmskb    edx,  xmm0
add     ecx,  edx
bsf     edx,  ecx
lea     eax,  [eax+edx-16]
ret
@L_End:
bsf     eax,  ecx
ret
StrLenA_Lingo         endp

Biterider,
Pls, keep a good traditions of the forum and don't allow so slow lamer's code to enter in Masm64 SDKs! :nie:









Quid sit futurum cras fuge quaerere.

jj2007

Quote
Align 16
db 7 dup (90h)
StrLenA_Lingo                proc
                movdqu        xmm1, [rax]
                pxor        xmm0, xmm0
                pcmpeqb    xmm1, xmm0
                pmovmskb    ecx,  xmm1   
                test        ecx,  ecx
                jne        @L_End
                mov        edx,  eax
                and        rax,  -16
@@:
                pcmpeqb    xmm0, [rax+16]
                pcmpeqb    xmm1, [rax+32]
                por        xmm1, xmm0       
                pmovmskb    ecx,  xmm1   
                test        ecx,  ecx
                lea        eax,  [eax+32]       
                jz        @b
                shl        ecx,  16   
                sub        eax,  edx
                pmovmskb    edx,  xmm0
                add        ecx,  edx   
                bsf        edx,  ecx
                lea        eax,  [eax+edx-16]
                ret
@L_End:
                bsf        eax,  ecx
                ret
StrLenA_Lingo                endp

Biterider,
Pls, keep a good traditions of the forum and don't allow so slow lamer's code to enter in Masm64 SDKs! :nie:

I can echo that, forum traditions are important :thumbsup:

Attached the Masm64 SDK source with Lingo's addition.

P.S.: I've permitted myself to add a mov rax, rcx under "proc" because, y'know, in 64-bit land the one and only string pointer is passed in rcx, not rax. There is another tiny modification... find out yourself.

Timings for a 100-byte string:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10 mega iterations, 1024 instructions
929 megacycles for szLen (Masm64 SDK)
148 megacycles for szLenSSE
137 megacycles for StrLenA_Lingo

930 megacycles for szLen (Masm64 SDK)
145 megacycles for szLenSSE
144 megacycles for StrLenA_Lingo

917 megacycles for szLen (Masm64 SDK)
147 megacycles for szLenSSE
134 megacycles for StrLenA_Lingo

926 megacycles for szLen (Masm64 SDK)
152 megacycles for szLenSSE
134 megacycles for StrLenA_Lingo

915 megacycles for szLen (Masm64 SDK)
138 megacycles for szLenSSE
134 megacycles for StrLenA_Lingo

63 bytes for szLen
31 bytes for szLenSSE
86 bytes for StrLenA_Lingo

Timings for a 30-byte string (frequent case when parsing sources):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10 mega iterations, 1024 instructions
301 megacycles for szLen (Masm64 SDK)
63 megacycles for szLenSSE
75 megacycles for StrLenA_Lingo

308 megacycles for szLen (Masm64 SDK)
64 megacycles for szLenSSE
76 megacycles for StrLenA_Lingo

306 megacycles for szLen (Masm64 SDK)
63 megacycles for szLenSSE
78 megacycles for StrLenA_Lingo

311 megacycles for szLen (Masm64 SDK)
59 megacycles for szLenSSE
76 megacycles for StrLenA_Lingo

302 megacycles for szLen (Masm64 SDK)
63 megacycles for szLenSSE
76 megacycles for StrLenA_Lingo

63 bytes for szLen
31 bytes for szLenSSE
86 bytes for StrLenA_Lingo

Caché GB

History of the lszLen algo.

donkey Frist version June 12 2005
https://www.masmforum.com/board/index.php?msg=15072

qWord (xmm-version) December 22 2008
https://www.masmforum.com/board/index.php?msg=77253

NightWare improves xmm-version December 22, 2008
https://www.masmforum.com/board/index.php?msg=77305

lingo speeds thing up March 06, 2009
https://www.masmforum.com/board/index.php?msg=80884

Any corrections?
Caché GB's 1 and 0-nly language:MASM

jj2007

Quote from: lingo on September 08, 2023, 10:33:18 PMFrom the well-known to the unknown, fake jj "tests" and "news", misinformation and hate rhetoric are causing harm to many new  asm programmers.  :nie:

I suppose you mean this by "hate rhetoric":
Quote from: jj2007 on September 08, 2023, 06:27:28 PMAttached the Masm64 SDK source with Lingo's addition.

P.S.: I've permitted myself to add a mov rax, rcx under "proc" because, y'know, in 64-bit land the one and only string pointer is passed in rcx, not rax. There is another tiny modification... find out yourself.

Sorry if that offended you. However, you posted code that obviously was not correct - it crashed. I admire your capacity to write code without testing it, but it's an old forum tradition to test code before posting it in The Lab.

Caché GB

I know someone who definitely posted code that is obviously not correct

https://masm32.com/board/index.php?msg=123445

32 bit code in 64 bit subforum and on top of that will crash.

StdOutExtreme proc lpszText:DWORD
  pop eax
  pop ecx
  push eax

  ;------------------------  WriteFile will clear this part of the stack
  push NULL
  push esp
  push rv(szLen, ecx)
  push ecx
  push rv(GetStdHandle, STD_OUTPUT_HANDLE)
  call WriteFile

That leaves Two pops and one push
Caché GB's 1 and 0-nly language:MASM

jj2007

Quote from: Caché GB on September 09, 2023, 12:18:44 AM32 bit code in 64 bit subforum and on top of that will crash.

I am very sorry that you don't understand my code. Are you the person who downloaded StdOutJJ.zìp? If yes, did it crash?

Caché GB

Hi JJ
No I did not run your code. I'll look into it although its def 32 bit (push eax will not assemble in x64). 
Again maybe I'am wrong.
Caché GB's 1 and 0-nly language:MASM

jj2007

Quote from: Caché GB on September 09, 2023, 12:54:36 AMHi JJ
No I did not run your code. I'll look into it although its def 32 bit (push eax will not assemble in x64).

No problem. The stack manipulations are not so easy to understand, I know.


Caché GB

Hi JJ
Yes stack manipulation is one of those things that can stay in ones blind spot for days, even weeks.
Thanks.
Caché GB's 1 and 0-nly language:MASM

Greenhorn

From the AMD Manual:

QuotePUSH     Push onto Stack

Decrements the stack pointer and then copies the specified immediate value or the value in the
specified register or memory location to the top of the stack (the memory location pointed to by
SS:rSP).

The operand-size attribute determines the number of bytes pushed to the stack. The stack-size attribute determines whether SP, ESP, or RSP is the stack pointer. The address-size attribute is used only to locate the memory operand when pushing a memory operand to the stack.

If the instruction pushes the stack pointer (rSP), the resulting value on the stack is that of rSP before execution of the instruction. There is a PUSH CS instruction but no corresponding POP CS. The RET (Far) instruction pops a value from the top of stack into the CS register as part of its operation.

In 64-bit mode, the operand size of all PUSH instructions defaults to 64 bits, and there is no prefix available to encode a 32-bit operand size. Using the PUSH CS, PUSH DS, PUSH ES, or PUSH SS instructions in 64-bit mode generates an invalid-opcode exception.

Pushing an odd number of 16-bit operands when the stack address-size attribute is 32 results in a misaligned stack pointer.
Kole Feut un Nordenwind gift en krusen Büdel un en lütten Pint.

Caché GB

Hi Greenhorn
So
           push  al
            pop  ah  ; <- does not assemble, error A2149: byte register cannot be first operand
 
           push  ax
            pop  ax  ; <- assembles

           push  eax
            pop  eax  ; <- does not assemble,  error A2070: invalid instruction operands

           push  rax
            pop  rax  ; <- assembles
Caché GB's 1 and 0-nly language:MASM

lingo

Hi jj,
pls, try to explain to stoo23 my notes: to understand that i don't agree with your code and testing results but do not critisize you personaly because I know you from very long time ago...


27%  faster with easy! :skrewy:

Intel(R) Core(TM) i7-6700K CPU @ 4.33GHz

10 mega iterations, 10240 instructions

214 megacycles for StrLenA (Lingo)
293 megacycles for szLenSSE

212 megacycles for StrLenA (Lingo)
292 megacycles for szLenSSE

213 megacycles for StrLenA (Lingo)
297 megacycles for szLenSSE

212 megacycles for StrLenA (Lingo)
291 megacycles for szLenSSE

214 megacycles for StrLenA (Lingo)
294 megacycles for szLenSSE

78 bytes for StrLenA
31 bytes for szLenSSE

Quid sit futurum cras fuge quaerere.

stoo23

QuoteHi jj,
pls, try to explain to stoo23 my notes: to understand that i don't agree with your code and testing results but do not criticize you personally because I know you from very long time ago...

Hey, Why can't You explain it to me yourself ???

I care not how long you have known him, your next post was simply Not acceptable here in the Open Forum.
I asked you to edit it and you have not, so, as suggested I will OK?

I'm telling you Personally, not through an 'Agent', although I could also ask JJ to "Tell You".
My apologies to the rest of you  :sad: