memcpy_SSE

guga · May 30, 2013, 03:33:54 PM

This algorithm suitable for copying large amount of memory blocks (aligned or not). It can also be used to copy large strings when you already have their lenght

Code Select


; Fast memory copy. Can also be used as a fast string copy if you know the string lenght.
; pdest = destination buffer for the copied memory
; psource = the inputed buffer
; lenght = len of the inputed buffer (psource).
; It is the same functionality as the memcpy from msvcrt.dll, bu it is faster.
; I works copying from 128 bits to 128 at once. (4 dwords)

Proc memcpy_SSE:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15
        mov edx 0 ; here it is used as an index
        L1:
            movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            lea edx D$edx+2 ; we are copying the 128 bits. So instead simply inc by 1, we made it by 2, because each index holds only 8 bytes (limitation of the operand multiplication edx*8 / esi*8 etc)
                  ; So, when edx = 0. edx*8 = 0. X$esi will point to esi+0 bytes
                  ; when edx = 2. edx*8*2 = edx*16. X$esi will point to esi+16 bytes
                  ; when edx = 4. edx*8*4 =edx*32. X$esi will point to esi+32  bytes.
                  ; So. The important is that after each loop esiand edi must points 16 bytes ahead.
            dec ecx   ; ecx is our counter. It simply computes the lenght/16. Why 16 ? because we are jumping from 4 to 4 dwords. Which means that the loop is 16 x faster then using a regular byte by byte operation.
            jnz L1<
        emms ; clear the regsters back to use on FPU
        shl edx 3 ; mul edx by 8 to get the pos
        add edi edx
        add esi edx
        jmp L2> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15

L2:

    ; If the memory of not 4 dword aligned we may have some remainder here So, just clean them.
    While eax <> 0
        movsb
        dec eax
    End_While

EndP

Example of usage:

Code Select


[OutputBuffer: B$ 0 #2048]

    mov esi {B$ "Hello, my name is g works as expected, since i´ tryoing to give a update of here. Hello, my name is guga, i´m 41 years old. Brazilian. I am testing this 128 bit operation to see if it works ok ? I hope works as expected, since i´ tryoing to give a update of here. Hello, my name is guga, i´m 41 years old. Brazilian. I am testing this 128 bit operation to see if it works ok ? I hope works as expected, since i´ tryoing to     ", 0};D@pSource
    c_call 'msvcrt.strlen' esi

    call memcpy_SSE OutputBuffer, esi, eax

jj2007 · May 30, 2013, 04:41:46 PM

Looks good - for some alignments better than the CRT :t
Here are some timings.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 84
---------------------------------------------------------------------------------------------
2048, d0s0-0 552 555 359 424 424 359 547 541 558
2048, d1s1-0 734 597 410 473 473 410 1060 798 815
2048, d7s7-0 737 598 422 474 474 411 1059 798 815
2048, d7s8-1 809 867 1016 563 566 567 802 543 560
2048, d7s9-2 809 853 1016 563 566 567 1058 798 801
2048, d8s7+1 819 847 871 563 564 565 804 606 631
2048, d8s8-0 720 602 404 465 465 402 547 559 544
2048, d8s9-1 802 847 1009 563 565 568 806 607 631
2048, d9s7+2 808 855 861 565 564 565 1060 798 801
2048, d9s8+1 823 852 862 563 576 565 803 543 546
2048, d9s9-0 721 594 411 472 486 409 1060 798 801
2048, d15s15 727 591 425 470 471 408 1059 798 802

Legend:
2048   bytes copied
d7s7-0   dest7, src7: both are 7 bytes above a 16-byte aligned dest/src
d7s9-2   dest7, src9: dest is 7 bytes above, src 9 bytes above a 16-byte alignment; diff src-dest = 2 bytes misalignment

Gunther · May 30, 2013, 07:07:13 PM

Jochen,

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
 
Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy memcpy_S 
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32   Habran     Guga 
                       dest-al    psllq CeleronM  dest-al   src-al  library                   
Code size           ?       70      291      222      200      269       33      104       84 
--------------------------------------------------------------------------------------------- 
2048, d0s0-0      361      435      246      248      247      249      223      291      282 
2048, d1s1-0      274      251      275      272      276      272      281      305      291 
2048, d7s7-0      276      254      278      274      279      275      281      306      290 
2048, d7s8-1      287      286      617      452      260      274      281      306      291 
2048, d7s9-2      286      287      617      452      261      274      281      306      291 
2048, d8s7+1      280      290      622      481      262      277      283      306      291 
2048, d8s8-0      275      255      295      284      285      292      282      306      290 
2048, d8s9-1      280      278      612      455      267      280      287      311      292 
2048, d9s7+2      290      294      613      487      267      280      287      311      296 
2048, d9s8+1      291      296      612      487      268      280      287      311      296 
2048, d9s9-0      278      261      283      281      280      286      287      311      296 
2048, d15s15      277      218      285      281      284      288      287      310      295

--- ok ---

thank you for your distribution.

Gunther

dedndave · May 31, 2013, 12:19:21 AM

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2  MemCoL   xmemcpy memcpy_S
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32   Habran     Guga
                       dest-al    psllq CeleronM  dest-al   src-al  library

Code size           ?       70      291      222      200      269       33      104       84
---------------------------------------------------------------------------------------------
2048, d0s0-0      737      730      618      616      616      615      725     1589     1603
2048, d1s1-0     1103      833      641      642      648      646     4394     3935     3923
2048, d7s7-0     1007      847      649      647      651      645     4400     3927     3921
2048, d7s8-1     1372     1452     1220      870      617      619     4322     3800     3788
2048, d7s9-2     1371     1450     1212      870      619      621     4437     3925     3929
2048, d8s7+1     1360     1443     1193     1311      618     1042     1360     1774     1780
2048, d8s8-0      980      855      655      655      659      647      981     1592     1596
2048, d8s9-1     1353     1468     1208      870      615      618     1356     1761     1768
2048, d9s7+2     1681     1445     1207     1320      617     1041     4158     4105     4137
2048, d9s8+1     1681     1449     1182     1311      620     1033     4025     4012     4029
2048, d9s9-0     1104      837      666      665      666      664     4154     4100     4125
2048, d15s15      767      843      657      657      658      653     4143     4117     4135

MichaelW · May 31, 2013, 02:52:15 AM

Running on my P4 Northwood I get an exception:
c000001eh = STATUS_INVALID_LOCK_SEQUENCE

Code Select


Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xme
mcpy memcpy_S
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32   Ha
bran     Guga
                       dest-al    psllq CeleronM  dest-al   src-al  library

Code size           ?       70      291      222      200      269       33
 104       84
--------------------------------------------------------------------------------
-------------
2048, d0s0-0      616      629      493      494      492      489      632
1322     1333
2048, d1s1-0     1092      734      537      541      547      543     3395
3225     3301
2048, d7s7-0      881      734      551      553      567      554     3408
3224     3312
2048, d7s8-1     1964     1662

Code Select


Microsoft (R) DrWtsn32
Copyright (C) 1985-2001 Microsoft Corp. All rights reserved.

Application exception occurred:
        App: \\P3\MemCopySSE2\MemCopySSE2.exe (pid=3728)
        When: 5/30/2013 @ 11:33:52.343
        Exception number: c000001e 
()

*----> System Information <----*
        Computer Name: DELL
        User Name: User
        Terminal Session Id: 0
        Number of Processors: 2
        Processor Type: x86 Family 15 Model 2 Stepping 9
        Windows Version: 5.1
        Current Build: 2600
        Service Pack: 3
        Current Type: Multiprocessor Free
        Registered Organization: 
        Registered Owner: User

*----> Task List <----*
   0 System Process
   4 System
 564 smss.exe
 620 csrss.exe
 644 winlogon.exe
 688 services.exe
 700 lsass.exe
 876 svchost.exe
 944 svchost.exe
1040 MsMpEng.exe
1076 svchost.exe
1180 svchost.exe
1252 svchost.exe
1476 spoolsv.exe
1696 svchost.exe
1744 cisvc.exe
1784 nvsvc32.exe
 544 alg.exe
1716 Explorer.EXE
 500 smax4pnp.exe
 852 BCMSMMSG.exe
 240 point32.exe
1144 msseces.exe
1504 ctfmon.exe
3472 cidaemon.exe
3728 MemCopySSE2.exe
3976 drwtsn32.exe

*----> Module List <----*
(0000000000400000 - 000000000040f000: \\P3\MemCopySSE2\MemCopySSE2.exe
(0000000076390000 - 00000000763ad000: C:\WINDOWS\system32\IMM32.DLL
(0000000077b40000 - 0000000077b62000: C:\WINDOWS\system32\Apphelp.dll
(0000000077c00000 - 0000000077c08000: C:\WINDOWS\system32\VERSION.dll
(0000000077c10000 - 0000000077c68000: C:\WINDOWS\system32\msvcrt.dll
(0000000077dd0000 - 0000000077e6b000: C:\WINDOWS\system32\ADVAPI32.dll
(0000000077e70000 - 0000000077f03000: C:\WINDOWS\system32\RPCRT4.dll
(0000000077f10000 - 0000000077f59000: C:\WINDOWS\system32\GDI32.dll
(0000000077f60000 - 0000000077fd6000: C:\WINDOWS\system32\SHLWAPI.dll
(0000000077fe0000 - 0000000077ff1000: C:\WINDOWS\system32\Secur32.dll
(000000007c800000 - 000000007c8f6000: C:\WINDOWS\system32\kernel32.dll
(000000007c900000 - 000000007c9b2000: C:\WINDOWS\system32\ntdll.dll
(000000007e410000 - 000000007e4a1000: C:\WINDOWS\system32\user32.dll

*----> State Dump for Thread Id 0xa44 <----*

eax=00000001 ebx=756e6547 ecx=00000080 edx=00000038 esi=0040dca0 edi=0040e4e0
eip=0040ad38 esp=0012ff98 ebp=0012fff0 iopl=0         nv up ei ng nz ac po cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000297

*** WARNING: Unable to verify checksum for \\P3\MemCopySSE2\MemCopySSE2.exe
*** ERROR: Module load completed but symbols could not be loaded for \\P3\MemCopySSE2\MemCopySSE2.exe
function: MemCopySSE2
        0040ad1b cb               retf
        0040ad1c 660febca         por     xmm1,dx
        0040ad20 660f7f0f         movdqa  oword ptr [edi],xmm1
        0040ad24 8d7610           lea     esi,[esi+0x10]
        0040ad27 49               dec     ecx
        0040ad28 8d7f10           lea     edi,[edi+0x10]
        0040ad2b 75d9             jnz     MemCopySSE2+0xad06 (0040ad06)
        0040ad2d eb26             jmp     MemCopySSE2+0xad55 (0040ad55)
        0040ad2f f30f7e5608       movq    xmm2,qword ptr [esi+0x8]
        0040ad34 0f165610         movhps  xmm2,qword ptr [esi+0x10]
FAULT ->0040ad38 f20ff0           repne   ???

And same result for my second download.

jj2007 · May 31, 2013, 03:11:25 AM

Here is the culprit...

0040AD2F ³> ÃF30F7E56 08 Úmovq xmm2, [esi+8]
0040AD34 ³. ³0F1656 10 ³movhps xmm2, [esi+10]
0040AD38 ³. ³F20FF00E ³lddqu xmm1, [esi]
0040AD3C ³. ³660FD3CC ³psrlq xmm1, xmm4
0040AD40 ³. ³660FF3D3 ³psllq xmm2, xmm3
0040AD44 ³. ³660FEBCA ³por xmm1, xmm2
0040AD48 ³. ³660F7F0F ³movdqa [edi], xmm1

guga · May 31, 2013, 03:52:39 AM

Thanks for the tests guys.

I made an update , i hope it is a bit faster now.

Code Select


; Version2 of Fast memory copy. Can also be used as a fast string copy if you know the string lenght
Proc memcpy_SSE_V2:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx   ; ecx is our counter. It simply computes the lenght/16. Why 16 ? because we are jumping from 4 to 4 dwords. Which means that the loop is 16 x faster then using a regular byte by byte operation.
            lea edx D$edx+2 ; we are copying the 128 bits. So instead simply inc by 1, we made it by 2, because each index holds only 8 bytes (limitation of the operand multiplication edx*8 / esi*8 etc)
                  ; So, when edx = 0. edx*8 = 0. X$esi will point to esi+0 bytes
                  ; when edx = 2. edx*8*2 = edx*16. X$esi will point to esi+16 bytes
                  ; when edx = 4. edx*8*4 =edx*32. X$esi will point to esi+32  bytes.
                  ; So. The important is that after each loop esiand edi must points 16 bytes ahead.
            jnz L1<
        emms ; clear the registers back to use on FPU
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        mov eax eax ; fix potential stallings
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:
   
EndP

Is this version a bit faster ? Or the alignment of lea to avoid stallings didn´t resolved ?

jj2007 · May 31, 2013, 04:10:29 AM

Hi Guga,

Could you please mark the lines that you changed? I'm using Masm, and it was quite a bit of work to adapt the syntax ;-)

guga · May 31, 2013, 06:17:15 AM

Ok, many thanks...

OLD

QuoteProc memcpy_SSE:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax

mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15
mov edx 0 ; here it is used as an index
L1:
movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
lea edx D$edx+2
dec ecx jnz L1<
emms ; clear the regsters back to use on FPU
shl edx 3 ; mul edx by 8 to get the pos
add edi edx
add esi edx
jmp L2> ; jmp to the remainder computation

L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15

L2:

; If the memory of not 4 dword aligned we may have some remainder here So, just clean them.
While eax <> 0
movsb
dec eax
End_While

EndP

NEW

QuoteProc memcpy_SSE_V2:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax

mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
mov edx 0 ; here it is used as an index
L1:
movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
dec ecx
lea edx D$edx+2
jnz L1<
emms ; clear the registers back to use on FPU
test eax eax | jz L4> ; No remainders ? Exit
jmp L9> ; jmp to the remainder computation

L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
test eax eax | jz L4> ; No remainders ? Exit
L9:
lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
mov eax eax ; fix potential stallings
lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3: movsb | dec eax | jnz L3<

L4:

EndP

I tried to assemble your test with masm, but there was an error on xmm registers. CPU mode problem. Not sure if i have the last masm version (It is from 2.011/2.012 is that the last one ?)

jj2007 · May 31, 2013, 07:57:20 AM

Quote from: guga on May 31, 2013, 06:17:15 AM
Ok, many thanks...
...
I tried to assemble your test with masm, but there was an error on xmm registers. CPU mode problem. Not sure if i have the last masm version (It is from 2.011/2.012 is that the last one ?)

Here it is, results below; I hope everything is correctly translated.
Re Masm: Use JWasm instead - fully compatible, less problems, much faster.

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy memcpy_S
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32   Habran     Guga
                       dest-al    psllq CeleronM  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33      104       88
---------------------------------------------------------------------------------------------
2048, d0s0-0      556      566      363      363      373      363      563     1051      944
2048, d1s1-0     1047      619      421      423      444      423     1684     1782     1705
2048, d7s7-0      567      619      418      420      446      420     1733     1782     1705
2048, d7s8-1     1677     1714     1090      441     1118     1118     1302     1337     1271
2048, d7s9-2     1677     1714     1090      441     1118     1118     1716     1782     1715
2048, d8s7+1     1655     1503     1090      888      980      975     1648     1245     1133
2048, d8s8-0      556      619      420      422      448      422      563     1051      944
2048, d8s9-1     1664     1714     1083      441     1118     1118     1661     1241     1137
2048, d9s7+2     1668     1502     1081      889      980      975     1763     1496     1425
2048, d9s8+1     1668     1502     1081      889      980      975     1283     1052      945
2048, d9s9-0     1047      619      420      422      448      422     1687     1498     1446
2048, d15s15      567      619      423      424      446      424     1679     1498     1445

Gunther · May 31, 2013, 08:01:05 AM

Here the new results:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
 
Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy memcpy_S 
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32   Habran     Guga 
                       dest-al    psllq CeleronM  dest-al   src-al  library                   
Code size           ?       70      291      222      200      269       33      104       88 
--------------------------------------------------------------------------------------------- 
2048, d0s0-0      444      224      252      246      248      249      222      293      281 
2048, d1s1-0      269      246      276      272      274      274      278      304      295 
2048, d7s7-0      271      250      276      268      275      272      280      302      301 
2048, d7s8-1      283      283      614      453      262      272      283      307      296 
2048, d7s9-2      289      284      617      453      262      269      283      307      301 
2048, d8s7+1      281      290      622      484      263      272      283      307      301 
2048, d8s8-0      271      256      296      284      285      292      282      306      300 
2048, d8s9-1      275      289      611      453      261      275      282      306      301 
2048, d9s7+2      287      291      609      484      262      276      282      306      301 
2048, d9s8+1      281      290      605      484      264      275      273      306      301 
2048, d9s9-0      274      256      280      277      278      282      279      308      306 
2048, d15s15      273      260      286      274      277      282      278      303      297

--- ok ---

Gunther

MichaelW · May 31, 2013, 08:23:09 AM

I still get the same exception.

And now that I check "LDDQU" in the Intel manual:

Quote
5.7 SSE3 INSTRUCTIONS
...
5.7.2 SSE3 Specialized 128-bit Unaligned Data Load Instruction
LDDQU Specialized 128-bit unaligned load designed to avoid cache line splits.

dedndave · May 31, 2013, 08:34:15 AM

p4 prescott w/htt, xp mce2005, sp3

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy memcpy_S
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32   Habran     Guga
                       dest-al    psllq CeleronM  dest-al   src-al  library

Code size           ?       70      291      222      200      269       33      104       88
---------------------------------------------------------------------------------------------
2048, d0s0-0      737      729      613      616      616      611      726     1593     1610
2048, d1s1-0     1102      833      641      641      648      647     4396     3939     3938
2048, d7s7-0     1005      850      648      647      650      655     4416     3935     4079
2048, d7s8-1     1367     1451     1208      870      615      619     4320     3817     3796
2048, d7s9-2     1369     1451     1217      869      618      619     4471     3920     3928
2048, d8s7+1     1360     1440     1186     1311      622     1043     1355     1781     1783
2048, d8s8-0      979      854      655      653      657      646      981     1592     1595
2048, d8s9-1     1354     1472     1204      870      616      619     1350     1764     1762
2048, d9s7+2     1682     1447     1181     1311      615     1037     4169     4122     4158
2048, d9s8+1     1682     1443     1175     1312      618     1033     4060     4030     4052
2048, d9s9-0     1103      835      665      664      664      665     4153     4116     4138
2048, d15s15      765      842      656      655      652      655     4169     4122     4142

jj2007 · May 31, 2013, 08:36:23 AM

Quote from: MichaelW on May 31, 2013, 08:23:09 AM
I still get the same exception.

Here is one without lddqu.

MichaelW · May 31, 2013, 08:50:58 AM

Quote from: jj2007 on May 31, 2013, 08:36:23 AM
Here is one without lddqu.

Code Select


Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)

Algo           memcpy   MemCo1   MemCoL  xmemcpy memcpy_S
Description       CRT rep movs   Masm32   Habran     Guga
                       dest-al  library
Code size           ?       70       33      104       88
---------------------------------------------------------
2048, d0s0-0      628      642      642     1321     1331
2048, d1s1-0     1087      715     4164     3140     3187
2048, d7s7-0      867      713     3264     3325     7161
2048, d7s8-1     3359     3530     5099     4081     4124
2048, d7s9-2     3357     3528     7352     7144     7257
2048, d8s7+1     3180     3382     3188     4296     4293
2048, d8s8-0     1978     1412     1972     2137     2164
2048, d8s9-1     3310     3499     3103     4276     4257
2048, d9s7+2     4970     3384     5744     5617     5620
2048, d9s8+1     4952     3377     5117     3920     3935
2048, d9s9-0     2932     1368     6637     6043     6071
2048, d15s15     1295     1454     6638     6023     6079


--- ok ---

The MASM Forum

News:

memcpy_SSE

guga

jj2007

Gunther

dedndave

MichaelW

jj2007

guga

jj2007

guga

jj2007

Gunther

MichaelW

dedndave

jj2007

MichaelW