StrLen for Masm64 SDK

jj2007 · September 08, 2023, 07:38:38 AM

Can I have some timings, please? Pure Masm64 SDK code attached.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10 mega iterations, 10240 instructions
1080 megacycles for szLen (Masm64 SDK)
269 megacycles for szLenSSE

1067 megacycles for szLen (Masm64 SDK)
268 megacycles for szLenSSE

1087 megacycles for szLen (Masm64 SDK)
268 megacycles for szLenSSE

1111 megacycles for szLen (Masm64 SDK)
269 megacycles for szLenSSE

1073 megacycles for szLen (Masm64 SDK)
260 megacycles for szLenSSE

63 bytes for szLen
31 bytes for szLenSSE

jj2007 · September 08, 2023, 08:45:51 AM

Only a factor 3 for my AMD:

Code Select

AMD Athlon Gold 3150U with Radeon Graphics
10 mega iterations, 10240 instructions
1088 megacycles for szLen (Masm64 SDK)
354 megacycles for szLenSSE

1084 megacycles for szLen (Masm64 SDK)
365 megacycles for szLenSSE

1102 megacycles for szLen (Masm64 SDK)
330 megacycles for szLenSSE

1035 megacycles for szLen (Masm64 SDK)
323 megacycles for szLenSSE

1106 megacycles for szLen (Masm64 SDK)
321 megacycles for szLenSSE

63 bytes for szLen
31 bytes for szLenSSE

lingo · September 08, 2023, 10:56:18 AM

Again useless test between 2 slow and slowest StrLen algos.

Code Select

szLenSSE proc	                ; 31 bytes; JJ 7 Sept 23
        xor     eax, eax
        xorps   xmm1, xmm1
@@:	movups  xmm0, [rcx+rax]
	pcmpeqb xmm0, xmm1	; any nullbytes in there?
	add eax, 16
	pmovmskb edx, xmm0	; show them in edx
	bsf edx, edx	        ; get index of first nullbyte
	je @B
        lea eax, [edx+eax-16]	; return len(rcx)
        ret
szLenSSE endp

Notes:
1. Un usage of movups slow instruction ONLY (no attempt to including in algo movdqa or movaps)
2. movups xmm0, [rcx+rax] is slower then movups xmm0, [rax]
3. Un usage of two instructions for the result of comparing: I. movups xmm0, [rcx+rax] II. pcmpeqb xmm0, xmm1
4. Very bad!!

Very slow instruction bsf edx, edx is included in the loop!??

My algo Strlen will win easily.

Code Select


Align 16
db 7 dup (90h) 
StrLenA_Lingo		        proc
				movdqu	    xmm1, [rax]
				pxor	    xmm0, xmm0
				pcmpeqb     xmm1, xmm0
				pmovmskb    ecx,  xmm1	
				test	    ecx,  ecx
				jne	    @L_End
				mov	    edx,  eax
				and	    rax,  -16
@@:
				pcmpeqb     xmm0, [rax+16]
				pcmpeqb     xmm1, [rax+32]
				por	    xmm1, xmm0		
				pmovmskb    ecx,  xmm1	
				test	    ecx,  ecx
				lea	    eax,  [eax+32]		
				jz	    @b
				shl	    ecx,  16	
				sub	    eax,  edx
				pmovmskb    edx,  xmm0
				add	    ecx,  edx	
				bsf	    edx,  ecx
				lea	    eax,  [eax+edx-16]
				ret
@L_End:
				bsf	    eax,  ecx
				ret
StrLenA_Lingo		        endp

Biterider,
Pls, keep a good traditions of the forum and don't allow so slow lamer's code to enter in Masm64 SDKs!

jj2007 · September 08, 2023, 06:27:28 PM

Quote
Code Select Expand
Align 16 db 7 dup (90h) StrLenA_Lingo proc movdqu xmm1, [rax] pxor xmm0, xmm0 pcmpeqb xmm1, xmm0 pmovmskb ecx, xmm1 test ecx, ecx jne @L_End mov edx, eax and rax, -16 @@: pcmpeqb xmm0, [rax+16] pcmpeqb xmm1, [rax+32] por xmm1, xmm0 pmovmskb ecx, xmm1 test ecx, ecx lea eax, [eax+32] jz @b shl ecx, 16 sub eax, edx pmovmskb edx, xmm0 add ecx, edx bsf edx, ecx lea eax, [eax+edx-16] ret @L_End: bsf eax, ecx ret StrLenA_Lingo endp
Biterider,
Pls, keep a good traditions of the forum and don't allow so slow lamer's code to enter in Masm64 SDKs!

I can echo that, forum traditions are important

Attached the Masm64 SDK source with Lingo's addition.

P.S.: I've permitted myself to add a mov rax, rcx under "proc" because, y'know, in 64-bit land the one and only string pointer is passed in rcx, not rax. There is another tiny modification... find out yourself.

Timings for a 100-byte string:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10 mega iterations, 1024 instructions
929 megacycles for szLen (Masm64 SDK)
148 megacycles for szLenSSE
137 megacycles for StrLenA_Lingo

930 megacycles for szLen (Masm64 SDK)
145 megacycles for szLenSSE
144 megacycles for StrLenA_Lingo

917 megacycles for szLen (Masm64 SDK)
147 megacycles for szLenSSE
134 megacycles for StrLenA_Lingo

926 megacycles for szLen (Masm64 SDK)
152 megacycles for szLenSSE
134 megacycles for StrLenA_Lingo

915 megacycles for szLen (Masm64 SDK)
138 megacycles for szLenSSE
134 megacycles for StrLenA_Lingo

63 bytes for szLen
31 bytes for szLenSSE
86 bytes for StrLenA_Lingo

Timings for a 30-byte string (frequent case when parsing sources):

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
10 mega iterations, 1024 instructions
301 megacycles for szLen (Masm64 SDK)
63 megacycles for szLenSSE
75 megacycles for StrLenA_Lingo

308 megacycles for szLen (Masm64 SDK)
64 megacycles for szLenSSE
76 megacycles for StrLenA_Lingo

306 megacycles for szLen (Masm64 SDK)
63 megacycles for szLenSSE
78 megacycles for StrLenA_Lingo

311 megacycles for szLen (Masm64 SDK)
59 megacycles for szLenSSE
76 megacycles for StrLenA_Lingo

302 megacycles for szLen (Masm64 SDK)
63 megacycles for szLenSSE
76 megacycles for StrLenA_Lingo

63 bytes for szLen
31 bytes for szLenSSE
86 bytes for StrLenA_Lingo

Caché GB · September 08, 2023, 11:23:11 PM

History of the lszLen algo.

donkey Frist version June 12 2005
https://www.masmforum.com/board/index.php?msg=15072

qWord (xmm-version) December 22 2008
https://www.masmforum.com/board/index.php?msg=77253

NightWare improves xmm-version December 22, 2008
https://www.masmforum.com/board/index.php?msg=77305

lingo speeds thing up March 06, 2009
https://www.masmforum.com/board/index.php?msg=80884

Any corrections?

jj2007 · September 09, 2023, 12:00:01 AM

Quote from: lingo on September 08, 2023, 10:33:18 PMFrom the well-known to the unknown, fake jj "tests" and "news", misinformation and hate rhetoric are causing harm to many new asm programmers.

I suppose you mean this by "hate rhetoric":

Quote from: jj2007 on September 08, 2023, 06:27:28 PMAttached the Masm64 SDK source with Lingo's addition.

P.S.: I've permitted myself to add a mov rax, rcx under "proc" because, y'know, in 64-bit land the one and only string pointer is passed in rcx, not rax. There is another tiny modification... find out yourself.

Sorry if that offended you. However, you posted code that obviously was not correct - it crashed. I admire your capacity to write code without testing it, but it's an old forum tradition to test code before posting it in The Lab.

Caché GB · September 09, 2023, 12:18:44 AM

I know someone who definitely posted code that is obviously not correct

https://masm32.com/board/index.php?msg=123445

32 bit code in 64 bit subforum and on top of that will crash.

Code Select

StdOutExtreme proc lpszText:DWORD
  pop eax
  pop ecx
  push eax

  ;------------------------  WriteFile will clear this part of the stack
  push NULL
  push esp
  push rv(szLen, ecx)
  push ecx
  push rv(GetStdHandle, STD_OUTPUT_HANDLE)
  call WriteFile

That leaves Two pops and one push

jj2007 · September 09, 2023, 12:23:26 AM

Quote from: Caché GB on September 09, 2023, 12:18:44 AM32 bit code in 64 bit subforum and on top of that will crash.

I am very sorry that you don't understand my code. Are you the person who downloaded StdOutJJ.zìp? If yes, did it crash?

Caché GB · September 09, 2023, 12:54:36 AM

Hi JJ
No I did not run your code. I'll look into it although its def 32 bit (push eax will not assemble in x64).
Again maybe I'am wrong.

jj2007 · September 09, 2023, 01:06:15 AM

Quote from: Caché GB on September 09, 2023, 12:54:36 AMHi JJ
No I did not run your code. I'll look into it although its def 32 bit (push eax will not assemble in x64).

No problem. The stack manipulations are not so easy to understand, I know.

Caché GB · September 09, 2023, 01:24:08 AM

Hi JJ
Yes stack manipulation is one of those things that can stay in ones blind spot for days, even weeks.
Thanks.

Greenhorn · September 09, 2023, 02:14:45 AM

From the AMD Manual:

QuotePUSH Push onto Stack

Decrements the stack pointer and then copies the specified immediate value or the value in the
specified register or memory location to the top of the stack (the memory location pointed to by
SS:rSP).

The operand-size attribute determines the number of bytes pushed to the stack. The stack-size attribute determines whether SP, ESP, or RSP is the stack pointer. The address-size attribute is used only to locate the memory operand when pushing a memory operand to the stack.

If the instruction pushes the stack pointer (rSP), the resulting value on the stack is that of rSP before execution of the instruction. There is a PUSH CS instruction but no corresponding POP CS. The RET (Far) instruction pops a value from the top of stack into the CS register as part of its operation.

In 64-bit mode, the operand size of all PUSH instructions defaults to 64 bits, and there is no prefix available to encode a 32-bit operand size. Using the PUSH CS, PUSH DS, PUSH ES, or PUSH SS instructions in 64-bit mode generates an invalid-opcode exception.

Pushing an odd number of 16-bit operands when the stack address-size attribute is 32 results in a misaligned stack pointer.

Caché GB · September 09, 2023, 02:37:00 AM

Hi Greenhorn
So

Code Select

           push  al
            pop  ah  ; <- does not assemble, error A2149: byte register cannot be first operand
 
           push  ax
            pop  ax  ; <- assembles

           push  eax
            pop  eax  ; <- does not assemble,  error A2070: invalid instruction operands

           push  rax
            pop  rax  ; <- assembles

lingo · September 09, 2023, 05:22:35 AM

Hi jj,
pls, try to explain to stoo23 my notes: to understand that i don't agree with your code and testing results but do not critisize you personaly because I know you from very long time ago...

27% faster with easy!

Code Select

Intel(R) Core(TM) i7-6700K CPU @ 4.33GHz

10 mega iterations, 10240 instructions

214 megacycles for StrLenA (Lingo)
293 megacycles for szLenSSE

212 megacycles for StrLenA (Lingo)
292 megacycles for szLenSSE

213 megacycles for StrLenA (Lingo)
297 megacycles for szLenSSE

212 megacycles for StrLenA (Lingo)
291 megacycles for szLenSSE

214 megacycles for StrLenA (Lingo)
294 megacycles for szLenSSE

78 bytes for StrLenA
31 bytes for szLenSSE

stoo23 · September 09, 2023, 05:43:46 AM

QuoteHi jj,
pls, try to explain to stoo23 my notes: to understand that i don't agree with your code and testing results but do not criticize you personally because I know you from very long time ago...

Hey, Why can't You explain it to me yourself ???

I care not how long you have known him, your next post was simply Not acceptable here in the Open Forum.
I asked you to edit it and you have not, so, as suggested I will OK?

I'm telling you Personally, not through an 'Agent', although I could also ask JJ to "Tell You".
My apologies to the rest of you

The MASM Forum

News:

StrLen for Masm64 SDK

jj2007

jj2007

lingo

jj2007

Caché GB

jj2007

Caché GB

jj2007

Caché GB

jj2007

Caché GB

Greenhorn

Caché GB

lingo

stoo23