Author Topic: Faster Memcopy ...  (Read 32980 times)

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #135 on: November 19, 2019, 07:17:29 PM »
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.

Lot of space there now: 64 * 32 regs = 2048 byte.

I think there is a bug in the ANSI version here:
args_x macro
    lea rcx,str_1[size_s]
    mov eax,step_x
    add eax,eax  <---------- HERE
    sub rcx,rax
    exitm<>
    endm

Although it is not causing a buffer overflow, it changes clearly the results.

These are the results after removing the "add eax, eax"

Code: [Select]
total [0 .. 40], 16++
   133475 cycles 5.asm: AVX 32
   152210 cycles 3.asm: SSE Intel Silvermont
   157375 cycles 1.asm: SSE 16
   172910 cycles 6.asm: AVX512 64
   178312 cycles 2.asm: SSE 32
   228359 cycles 0.asm: msvcrt.strlen()
   326672 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
   117358 cycles 5.asm: AVX 32
   123539 cycles 6.asm: AVX512 64
   165831 cycles 1.asm: SSE 16
   169369 cycles 3.asm: SSE Intel Silvermont
   210518 cycles 2.asm: SSE 32
   270514 cycles 0.asm: msvcrt.strlen()
   281378 cycles 4.asm: SSE Intel Atom

total [600 .. 1000], 200++
    67189 cycles 6.asm: AVX512 64
   110356 cycles 5.asm: AVX 32
   218898 cycles 3.asm: SSE Intel Silvermont
   235207 cycles 4.asm: SSE Intel Atom
   272100 cycles 2.asm: SSE 32
   296195 cycles 1.asm: SSE 16
   643732 cycles 0.asm: msvcrt.strlen()

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #136 on: November 19, 2019, 08:40:48 PM »
This is another suite of test results for 64-bit strlen variations, including Agner Fog, PCMPISTRI and Masm32 SDK.
I added a test for extra long strings in the range 40000 to 40800 bytes.
Masm32 SDK strlen is not SIMD assisted (and I believe msvcrt.strlen is not as well) so plays a different tournament. But it could be made that way because all 64-bit machines have support for SSE.

Code: [Select]
total [0 .. 40], 16++
   135718 cycles 5.asm: AVX 32
   147583 cycles 8.asm: PCMPISTRI
   159476 cycles 1.asm: SSE 16
   169063 cycles 3.asm: SSE Intel Silvermont
   190066 cycles 2.asm: SSE 32
   192212 cycles 7.asm: Agner Fog
   210091 cycles 6.asm: AVX512 64
   238010 cycles 0.asm: msvcrt.strlen()
   280346 cycles 4.asm: SSE Intel Atom
   282475 cycles 9.asm: Masm32 SDK
   
   total [41 .. 80], 16++
   116625 cycles 5.asm: AVX 32
   120875 cycles 6.asm: AVX512 64
   136046 cycles 3.asm: SSE Intel Silvermont
   169359 cycles 8.asm: PCMPISTRI
   179466 cycles 1.asm: SSE 16
   198766 cycles 2.asm: SSE 32
   205015 cycles 7.asm: Agner Fog
   257180 cycles 0.asm: msvcrt.strlen()
   278100 cycles 4.asm: SSE Intel Atom
   487603 cycles 9.asm: Masm32 SDK

   total [600 .. 1000], 200++
    83570 cycles 6.asm: AVX512 64
   110477 cycles 5.asm: AVX 32
   218994 cycles 3.asm: SSE Intel Silvermont
   253867 cycles 4.asm: SSE Intel Atom
   279579 cycles 2.asm: SSE 32
   307387 cycles 1.asm: SSE 16
   334595 cycles 7.asm: Agner Fog
   488680 cycles 8.asm: PCMPISTRI
   621900 cycles 0.asm: msvcrt.strlen()
  1066191 cycles 9.asm: Masm32 SDK

  total [40000 .. 40800], 200++
   505134 cycles 6.asm: AVX512 64
   977509 cycles 5.asm: AVX 32
  1468195 cycles 3.asm: SSE Intel Silvermont
  1684275 cycles 4.asm: SSE Intel Atom
  2241774 cycles 2.asm: SSE 32
  2250641 cycles 1.asm: SSE 16
  2609106 cycles 7.asm: Agner Fog
  3257461 cycles 8.asm: PCMPISTRI
  4818268 cycles 0.asm: msvcrt.strlen()
  8809927 cycles 9.asm: Masm32 SDK
 

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #137 on: November 19, 2019, 11:42:23 PM »
In this test I introduce a new AVX-512 strlen variation called Fast AVX512 64  :biggrin:

Code: [Select]
.code
xor rax, rax
vxorps zmm1, zmm1, zmm1
L1:
vpcmpeqb k1, zmm1,zmmword ptr [rcx+rax]
kmovq  r9,k1
add rax, 64
test r9,r9
jz L1
    bsf             r9,r9
    lea             rax,[rax+r9-64]
    ret
end

These are the results:

Code: [Select]
total [0 .. 40], 16++
    83041 cycles 9.asm: Fast AVX512 64
   118463 cycles 8.asm: PCMPISTRI
   136861 cycles 5.asm: AVX 32
   145743 cycles 3.asm: SSE Intel Silvermont
   163889 cycles 1.asm: SSE 16
   178432 cycles 6.asm: AVX512 64
   185371 cycles 7.asm: Agner Fog
   196856 cycles 2.asm: SSE 32
   228516 cycles 0.asm: msvcrt.strlen()
   277227 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
    61027 cycles 9.asm: Fast AVX512 64
   111154 cycles 5.asm: AVX 32
   130256 cycles 6.asm: AVX512 64
   139440 cycles 3.asm: SSE Intel Silvermont
   155091 cycles 8.asm: PCMPISTRI
   183854 cycles 1.asm: SSE 16
   194775 cycles 7.asm: Agner Fog
   212161 cycles 2.asm: SSE 32
   285351 cycles 4.asm: SSE Intel Atom
   311238 cycles 0.asm: msvcrt.strlen()

total [600 .. 1000], 200++
    71159 cycles 9.asm: Fast AVX512 64
    81938 cycles 6.asm: AVX512 64
   110439 cycles 5.asm: AVX 32
   220499 cycles 3.asm: SSE Intel Silvermont
   254703 cycles 4.asm: SSE Intel Atom
   293130 cycles 2.asm: SSE 32
   308233 cycles 1.asm: SSE 16
   338944 cycles 7.asm: Agner Fog
   516498 cycles 8.asm: PCMPISTRI
   648680 cycles 0.asm: msvcrt.strlen()

total [40000 .. 40800], 200++
   390634 cycles 6.asm: AVX512 64
   414175 cycles 9.asm: Fast AVX512 64
   606734 cycles 5.asm: AVX 32
  1392867 cycles 3.asm: SSE Intel Silvermont
  1417887 cycles 4.asm: SSE Intel Atom
  2194951 cycles 1.asm: SSE 16
  2200795 cycles 2.asm: SSE 32
  2229910 cycles 7.asm: Agner Fog
  3295851 cycles 8.asm: PCMPISTRI
  4538755 cycles 0.asm: msvcrt.strlen()   

For huge strings the other variation catchs up.  :badgrin:
May be it can be improved just a little.


Later:

I have not used it in the above test but the following, which is based on the same idea, is also faster than AVX 32 except for huge strings (but the difference is small).

Code: [Select]
.code
    xor rax, rax
    vxorps ymm0, ymm0, ymm0
L1:
    vpcmpeqb ymm1, ymm0, ymmword ptr [rcx+rax]
    vpmovmskb r9d,ymm1
    add     rax,32
    test    r9d,r9d
    jz L1
    bsf  r9,r9
    lea  rax,[rax+r9-32]
    ret
    end
« Last Edit: November 20, 2019, 01:21:23 AM by AW »

nidud

  • Member
  • *****
  • Posts: 1800
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #138 on: November 20, 2019, 03:53:14 AM »
I think there is a bug in the ANSI version here:
args_x macro
    lea rcx,str_1[size_s]
    mov eax,step_x
    add eax,eax  <---------- HERE
    sub rcx,rax
    exitm<>
    endm

Yes, it was copied from the Unicode test.
I updated the timeit.inc file for 64-bit to remove the 0..9 limit.

You may now continue from 9.asm to a[..z].asm
The size of the info-array is now equal to the largest id used.

procs equ <for x,<0,1>> ; add functions to test...

    .data
    info_0 db "0",0 ; only used ones needed...
    info_1 db "1",0

The id-array is still numeric but files and info use chars

procs equ <for x,<0,10>> ; add functions to test...

    .data
    info_0 db "0",0
    info_1 db "1",0
    ...
    info_9 db "9",0
    info_a db "10",0


AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #139 on: November 20, 2019, 06:34:12 AM »
Very good Nidud.

Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.


Notepad++ users (probably not many here) can build these ASMC test suits for Debugging using this script:



Let's see if it works:




nidud

  • Member
  • *****
  • Posts: 1800
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #140 on: November 20, 2019, 07:17:44 AM »
Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.

The main include files uses:
.pragma comment(lib, libc, msvcrt)

This will use option dllimport:<msvcrt> if the -pe switch is used and includelib libc.lib if not.

timeit.inc:

    .pragma comment(lib, msvcrt)
    printf  proto :ptr byte, :vararg
    exit    proto :dword
    _getch  proto

    .pragma comment(lib, kernel32)
    GetCurrentProcess proto

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Faster Memcopy ...
« Reply #141 on: November 20, 2019, 10:58:27 PM »
I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.
We will need to learn how to use linkw.exe, right?
Uasm libs don't force us to use anything else (they have not as well, thanks God).

nidud

  • Member
  • *****
  • Posts: 1800
    • https://github.com/nidud/asmc
Re: Faster Memcopy ...
« Reply #142 on: November 21, 2019, 06:35:28 AM »
I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.

They are all compatible with LINK except for the 64-bit import libraries created by LIBW (needs an update). It has to be this way. Otherwise you will be forced to download a few gigs of corporate tools in order to build them or do as Hutch: use POLIB for this purpose.

There are many samples in the source directory on how to build def-files, include files, and import libraries from installed dll-files. To update the import libraries using LIB (or POLIB) you may add the following changes to the import.asm file and rebuild:

    fprintf(r12, "LIBRARY %s\nEXPORTS\n", rdi)
    .while r13d
        lodsd
        ;fprintf(r12, "++%s.\'%s.dll\'\n", addr [rax+rbx], rdi)
        fprintf(r12, "%s\n", &[rax+rbx])
        dec r13d
    .endw
    fclose(r12)
    lea rbx,buffer
    ;sprintf(rbx, "libw /n /c /q /b /fac /i6 %s\\%s.lib @%s.def", path, rdi, rdi)
    sprintf(rbx, "lib /machine:x64 /def:%s.def /out:%s\\%s.lib", rdi, path, rdi)

Note: the PATH needs to be set for LIB in the above sample.

Quote
We will need to learn how to use linkw.exe, right?

As for the benchmark test it uses neither include-files or a linker but in general terms no. The LIBC startup and auto install is a bit complicated and incorporated into the tools so this may differ depending on version and tool-chain. In the Asmc LIBC.LIB mainCRTStartup is included but not in the current msvcrt.dll (Win7-64) so building a 64-bit debug vesion using LINK:

include stdio.inc

    .code

main proc

    printf("debug: Win64 console application\n")
    xor eax,eax
    ret

main endp

    end

set LIB=\asmc\lib\amd64
asmc64 -Zi test.asm
link /map /debug /MACHINE:X64 /subsystem:console test.obj

To build without LIBC (use msvcrt.dll):
...
include tchar.inc
...
asmc64 -D__PE__ -Zi test.asm

If you want to debug LIBC add the full path of the source to the makefile:

AFLAGS = -Zi -Zp8 -D_CRTBLD -Cs -I$(inc_dir)

$(lib_dir)\amd64\libc.lib:
    asmc64 $(AFLAGS) /r G:\asmc\source\lib64\*.asm