Faster Memcopy ...

aw27 · November 19, 2019, 07:17:29 PM

Quote from: nidud on November 19, 2019, 07:12:37 AM
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.

Lot of space there now: 64 * 32 regs = 2048 byte.

I think there is a bug in the ANSI version here:
args_x macro
lea rcx,str_1[size_s]
mov eax,step_x
add eax,eax <---------- HERE
sub rcx,rax
exitm<>
endm

Although it is not causing a buffer overflow, it changes clearly the results.

These are the results after removing the "add eax, eax"

Code Select


total [0 .. 40], 16++
   133475 cycles 5.asm: AVX 32
   152210 cycles 3.asm: SSE Intel Silvermont
   157375 cycles 1.asm: SSE 16
   172910 cycles 6.asm: AVX512 64
   178312 cycles 2.asm: SSE 32
   228359 cycles 0.asm: msvcrt.strlen()
   326672 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
   117358 cycles 5.asm: AVX 32
   123539 cycles 6.asm: AVX512 64
   165831 cycles 1.asm: SSE 16
   169369 cycles 3.asm: SSE Intel Silvermont
   210518 cycles 2.asm: SSE 32
   270514 cycles 0.asm: msvcrt.strlen()
   281378 cycles 4.asm: SSE Intel Atom

total [600 .. 1000], 200++
    67189 cycles 6.asm: AVX512 64
   110356 cycles 5.asm: AVX 32
   218898 cycles 3.asm: SSE Intel Silvermont
   235207 cycles 4.asm: SSE Intel Atom
   272100 cycles 2.asm: SSE 32
   296195 cycles 1.asm: SSE 16
   643732 cycles 0.asm: msvcrt.strlen()

aw27 · November 19, 2019, 08:40:48 PM

This is another suite of test results for 64-bit strlen variations, including Agner Fog, PCMPISTRI and Masm32 SDK.
I added a test for extra long strings in the range 40000 to 40800 bytes.
Masm32 SDK strlen is not SIMD assisted (and I believe msvcrt.strlen is not as well) so plays a different tournament. But it could be made that way because all 64-bit machines have support for SSE.

Code Select


total [0 .. 40], 16++
   135718 cycles 5.asm: AVX 32
   147583 cycles 8.asm: PCMPISTRI
   159476 cycles 1.asm: SSE 16
   169063 cycles 3.asm: SSE Intel Silvermont
   190066 cycles 2.asm: SSE 32
   192212 cycles 7.asm: Agner Fog
   210091 cycles 6.asm: AVX512 64
   238010 cycles 0.asm: msvcrt.strlen()
   280346 cycles 4.asm: SSE Intel Atom
   282475 cycles 9.asm: Masm32 SDK
    
   total [41 .. 80], 16++
   116625 cycles 5.asm: AVX 32
   120875 cycles 6.asm: AVX512 64
   136046 cycles 3.asm: SSE Intel Silvermont
   169359 cycles 8.asm: PCMPISTRI
   179466 cycles 1.asm: SSE 16
   198766 cycles 2.asm: SSE 32
   205015 cycles 7.asm: Agner Fog
   257180 cycles 0.asm: msvcrt.strlen()
   278100 cycles 4.asm: SSE Intel Atom
   487603 cycles 9.asm: Masm32 SDK

   total [600 .. 1000], 200++
    83570 cycles 6.asm: AVX512 64
   110477 cycles 5.asm: AVX 32
   218994 cycles 3.asm: SSE Intel Silvermont
   253867 cycles 4.asm: SSE Intel Atom
   279579 cycles 2.asm: SSE 32
   307387 cycles 1.asm: SSE 16
   334595 cycles 7.asm: Agner Fog
   488680 cycles 8.asm: PCMPISTRI
   621900 cycles 0.asm: msvcrt.strlen()
  1066191 cycles 9.asm: Masm32 SDK

  total [40000 .. 40800], 200++
   505134 cycles 6.asm: AVX512 64
   977509 cycles 5.asm: AVX 32
  1468195 cycles 3.asm: SSE Intel Silvermont
  1684275 cycles 4.asm: SSE Intel Atom
  2241774 cycles 2.asm: SSE 32
  2250641 cycles 1.asm: SSE 16
  2609106 cycles 7.asm: Agner Fog
  3257461 cycles 8.asm: PCMPISTRI
  4818268 cycles 0.asm: msvcrt.strlen()
  8809927 cycles 9.asm: Masm32 SDK

aw27 · November 19, 2019, 11:42:23 PM

In this test I introduce a new AVX-512 strlen variation called Fast AVX512 64

Code Select


.code
	xor rax, rax
	vxorps zmm1, zmm1, zmm1
L1:	
	vpcmpeqb k1, zmm1,zmmword ptr [rcx+rax]
	kmovq  r9,k1
	add rax, 64
	test r9,r9
	jz L1
    bsf             r9,r9
    lea             rax,[rax+r9-64]
    ret
end

These are the results:

Code Select


total [0 .. 40], 16++
    83041 cycles 9.asm: Fast AVX512 64
   118463 cycles 8.asm: PCMPISTRI
   136861 cycles 5.asm: AVX 32
   145743 cycles 3.asm: SSE Intel Silvermont
   163889 cycles 1.asm: SSE 16
   178432 cycles 6.asm: AVX512 64
   185371 cycles 7.asm: Agner Fog
   196856 cycles 2.asm: SSE 32
   228516 cycles 0.asm: msvcrt.strlen()
   277227 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
    61027 cycles 9.asm: Fast AVX512 64
   111154 cycles 5.asm: AVX 32
   130256 cycles 6.asm: AVX512 64
   139440 cycles 3.asm: SSE Intel Silvermont
   155091 cycles 8.asm: PCMPISTRI
   183854 cycles 1.asm: SSE 16
   194775 cycles 7.asm: Agner Fog
   212161 cycles 2.asm: SSE 32
   285351 cycles 4.asm: SSE Intel Atom
   311238 cycles 0.asm: msvcrt.strlen()

total [600 .. 1000], 200++
    71159 cycles 9.asm: Fast AVX512 64
    81938 cycles 6.asm: AVX512 64
   110439 cycles 5.asm: AVX 32
   220499 cycles 3.asm: SSE Intel Silvermont
   254703 cycles 4.asm: SSE Intel Atom
   293130 cycles 2.asm: SSE 32
   308233 cycles 1.asm: SSE 16
   338944 cycles 7.asm: Agner Fog
   516498 cycles 8.asm: PCMPISTRI
   648680 cycles 0.asm: msvcrt.strlen()

total [40000 .. 40800], 200++
   390634 cycles 6.asm: AVX512 64
   414175 cycles 9.asm: Fast AVX512 64
   606734 cycles 5.asm: AVX 32
  1392867 cycles 3.asm: SSE Intel Silvermont
  1417887 cycles 4.asm: SSE Intel Atom
  2194951 cycles 1.asm: SSE 16
  2200795 cycles 2.asm: SSE 32
  2229910 cycles 7.asm: Agner Fog
  3295851 cycles 8.asm: PCMPISTRI
  4538755 cycles 0.asm: msvcrt.strlen()

For huge strings the other variation catchs up.

May be it can be improved just a little.

Later:

I have not used it in the above test but the following, which is based on the same idea, is also faster than AVX 32 except for huge strings (but the difference is small).

Code Select


.code
    xor rax, rax
    vxorps ymm0, ymm0, ymm0
L1:	
    vpcmpeqb ymm1, ymm0, ymmword ptr [rcx+rax]
    vpmovmskb r9d,ymm1
    add     rax,32
    test    r9d,r9d
    jz L1
    bsf  r9,r9
    lea  rax,[rax+r9-32]
    ret
    end

nidud · November 20, 2019, 03:53:14 AM

deleted

aw27 · November 20, 2019, 06:34:12 AM

Very good Nidud.

Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.

Notepad++ users (probably not many here) can build these ASMC test suits for Debugging using this script:

Let's see if it works:

nidud · November 20, 2019, 07:17:44 AM

deleted

aw27 · November 20, 2019, 10:58:27 PM

I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.
We will need to learn how to use linkw.exe, right?
Uasm libs don't force us to use anything else (they have not as well, thanks God).

nidud · November 21, 2019, 06:35:28 AM

deleted

CMalcheski · April 03, 2020, 11:47:46 AM

Quote from: KeepingRealBusy on March 04, 2015, 10:23:39 AM
rrr

Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).

Dave.

Very few people take their efforts so far as to avoid this issue, but internally any data not aligned on its own size (i.e. a dword not aligned on a 4 byte boundary) is going to require 2 reads to retrieve the data. That's a big slowdown. The effort required to properly align data across the board is well worth it.

hutch-- · April 03, 2020, 12:18:35 PM

AW,

Clock this one in your list. Memory must be aligned by the data size.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

ymmcopya proc

; rcx = source address
; rdx = destination address
; r8 = byte count

mov r11, r8
shr r11, 5 ; div by 32 for loop count
xor r10, r10 ; zero r10 to use as index

lpst:
vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
vmovntdq YMMWORD PTR [rdx+r10], ymm0

add r10, 32
sub r11, 1
jnz lpst

mov rax, r8 ; calculate remainder if any
and rax, 31
test rax, rax
jnz @F
ret

@@:
mov r9b, [rcx+r10] ; copy any remainder
mov [rdx+r10], r9b
add r10, 1
sub rax, 1
jnz @B

ret

ymmcopya endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

aw27 · April 03, 2020, 08:57:49 PM

I can't test now but non-temporal instructions tend to perform poorly in modern processors. That's the main reason Agner Fog's tests did not shine.
I will test when possible or you can do it directly from nidud's example.

guga · August 17, 2023, 04:21:21 AM

Old thread, but can someone test this update of StrLen ? I tried to port it to masm to be used in masmbasic, but coulnd´t do it

This is an old routine, JJ and i did a long time ago, but i updated it, since it has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Now, it seems to handle all situations, but i´m not sure about the speed

RosAsm syntax:

Code Select

Proc StrLen3:
    Arguments @Input
    Uses ecx

    xorps xmm0 xmm0
    mov ecx D@Input

    L0:
         movups xmm1 X$ecx | pcmpeqb xmm0 xmm1 | pmovmskb eax xmm0
         add ecx 16 ; point to the next 16 bytes of our string
     test ax ax | jz L0< ; Did we found any zeroes in eax ? No, jmp back and do it again. Otherwise, go to the next line
    sub ecx D@Input
    add ecx 0-16
    bsf ax ax
    add eax ecx

EndP

Masm syntax

Code Select

StrLen3         proc near               ; CODE XREF: start+13↑p

Input           = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+Input]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx] ; ? Is this syntax correct ? or is only movups  xmm1, [ecx] ?
                pcmpeqb xmm0, xmm1
                pmovmskb eax, xmm0
                add     ecx, 10h
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+Input]
                add     ecx, -16
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLen3         endp

NoCforMe · August 17, 2023, 04:52:04 AM

Quote from: guga on August 17, 2023, 04:21:21 AM[...] and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Since when should this be a concern? Anything after the terminating NULL should just be ignored, right?

guga · August 17, 2023, 05:20:28 AM

Not really. I was giving a new test and found that the older routine has some flaws sometimes when the string is between other chain of bytes or is unaligned. This was happenning for the strlen and strlenw versions.

The problem was when the end of the string (after the null terminated byte) contains any extra data (also if the start of the string was unaglined), it was bypassing the null terminated and thus giving a wrong result.

This issues are fixed on the newer version (this one i posted and the other i´ll post latter about strlenw), but i wanna make sure if the speed was not affected.

The former error happened more offently when we were dealing with non-unicode strings (strlenW version). I mean, strings in chinese, russian etc etc.

Some examples was things like this in the unicode version:

Code Select

[UnicodeString5: B$ 1, 7, 13, 58, 12, 36, 77, 158, 18,
;U$ "Hello", 0
StringStart3: W$ 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 0

NextData3: W$ 7, 1, 1, 1, 1, 1, 1, 1]

[UnicodeString: U$ "Hello", 0]

[DummyBytes2: B$ 35, 47, 5, 5, 14, 5]
[< UnicodeStringUnaligned: B$ 1,
W$ 04f60, 0597d, 056fe, 06211, 04e00, 0
B$ 37]

And the former versions (That produced incorrect results in some circuinstances were like this:
unicode version

Code Select

Proc StrLenW::
    Arguments @Input
    Uses ecx, edx

    mov eax D@Input
    mov ecx eax
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP

Ansi version

Code Select

Proc StrLen::
    Arguments @Input
    Uses ecx, edx

    mov eax D@Input
    mov ecx eax
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP

With the new fixes, there should be no more any uknown error that results in a wrong value. Therefore, it works now for unaligned strings and also strings that are inside chains of data, and also in whatever CodePage (in case we are using the strlenW version)

jj2007 · August 17, 2023, 06:55:55 AM

Quote from: guga on August 17, 2023, 04:21:21 AMit has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Hi Guga,

I'm not aware of such problems, can you post an example of a string that misbehaves, in Masm syntax?

There is a known issue with VirtualAlloc'ed strings.

The MASM Forum

News:

Faster Memcopy ...

aw27

aw27

aw27

nidud

aw27

nidud

aw27

nidud

CMalcheski

hutch--

aw27

guga

NoCforMe

guga

jj2007