The MASM Forum
Projects => Rarely Used Projects => RosAsm => Topic started by: guga on May 30, 2013, 03:33:54 PM
-
This algorithm suitable for copying large amount of memory blocks (aligned or not). It can also be used to copy large strings when you already have their lenght
; Fast memory copy. Can also be used as a fast string copy if you know the string lenght.
; pdest = destination buffer for the copied memory
; psource = the inputed buffer
; lenght = len of the inputed buffer (psource).
; It is the same functionality as the memcpy from msvcrt.dll, bu it is faster.
; I works copying from 128 bits to 128 at once. (4 dwords)
Proc memcpy_SSE:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax
mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15
mov edx 0 ; here it is used as an index
L1:
movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
lea edx D$edx+2 ; we are copying the 128 bits. So instead simply inc by 1, we made it by 2, because each index holds only 8 bytes (limitation of the operand multiplication edx*8 / esi*8 etc)
; So, when edx = 0. edx*8 = 0. X$esi will point to esi+0 bytes
; when edx = 2. edx*8*2 = edx*16. X$esi will point to esi+16 bytes
; when edx = 4. edx*8*4 =edx*32. X$esi will point to esi+32 bytes.
; So. The important is that after each loop esiand edi must points 16 bytes ahead.
dec ecx ; ecx is our counter. It simply computes the lenght/16. Why 16 ? because we are jumping from 4 to 4 dwords. Which means that the loop is 16 x faster then using a regular byte by byte operation.
jnz L1<
emms ; clear the regsters back to use on FPU
shl edx 3 ; mul edx by 8 to get the pos
add edi edx
add esi edx
jmp L2> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15
L2:
; If the memory of not 4 dword aligned we may have some remainder here So, just clean them.
While eax <> 0
movsb
dec eax
End_While
EndP
Example of usage:
[OutputBuffer: B$ 0 #2048]
mov esi {B$ "Hello, my name is g works as expected, since i´ tryoing to give a update of here. Hello, my name is guga, i´m 41 years old. Brazilian. I am testing this 128 bit operation to see if it works ok ? I hope works as expected, since i´ tryoing to give a update of here. Hello, my name is guga, i´m 41 years old. Brazilian. I am testing this 128 bit operation to see if it works ok ? I hope works as expected, since i´ tryoing to ", 0};D@pSource
c_call 'msvcrt.strlen' esi
call memcpy_SSE OutputBuffer, esi, eax
-
Looks good - for some alignments better than the CRT :t
Here are some timings.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 84
---------------------------------------------------------------------------------------------
2048, d0s0-0 552 555 359 424 424 359 547 541 558
2048, d1s1-0 734 597 410 473 473 410 1060 798 815
2048, d7s7-0 737 598 422 474 474 411 1059 798 815
2048, d7s8-1 809 867 1016 563 566 567 802 543 560
2048, d7s9-2 809 853 1016 563 566 567 1058 798 801
2048, d8s7+1 819 847 871 563 564 565 804 606 631
2048, d8s8-0 720 602 404 465 465 402 547 559 544
2048, d8s9-1 802 847 1009 563 565 568 806 607 631
2048, d9s7+2 808 855 861 565 564 565 1060 798 801
2048, d9s8+1 823 852 862 563 576 565 803 543 546
2048, d9s9-0 721 594 411 472 486 409 1060 798 801
2048, d15s15 727 591 425 470 471 408 1059 798 802
Legend:
2048 bytes copied
d7s7-0 dest7, src7: both are 7 bytes above a 16-byte aligned dest/src
d7s9-2 dest7, src9: dest is 7 bytes above, src 9 bytes above a 16-byte alignment; diff src-dest = 2 bytes misalignment
-
Jochen,
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 84
---------------------------------------------------------------------------------------------
2048, d0s0-0 361 435 246 248 247 249 223 291 282
2048, d1s1-0 274 251 275 272 276 272 281 305 291
2048, d7s7-0 276 254 278 274 279 275 281 306 290
2048, d7s8-1 287 286 617 452 260 274 281 306 291
2048, d7s9-2 286 287 617 452 261 274 281 306 291
2048, d8s7+1 280 290 622 481 262 277 283 306 291
2048, d8s8-0 275 255 295 284 285 292 282 306 290
2048, d8s9-1 280 278 612 455 267 280 287 311 292
2048, d9s7+2 290 294 613 487 267 280 287 311 296
2048, d9s8+1 291 296 612 487 268 280 287 311 296
2048, d9s9-0 278 261 283 281 280 286 287 311 296
2048, d15s15 277 218 285 281 284 288 287 310 295
--- ok ---
thank you for your distribution.
Gunther
-
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 84
---------------------------------------------------------------------------------------------
2048, d0s0-0 737 730 618 616 616 615 725 1589 1603
2048, d1s1-0 1103 833 641 642 648 646 4394 3935 3923
2048, d7s7-0 1007 847 649 647 651 645 4400 3927 3921
2048, d7s8-1 1372 1452 1220 870 617 619 4322 3800 3788
2048, d7s9-2 1371 1450 1212 870 619 621 4437 3925 3929
2048, d8s7+1 1360 1443 1193 1311 618 1042 1360 1774 1780
2048, d8s8-0 980 855 655 655 659 647 981 1592 1596
2048, d8s9-1 1353 1468 1208 870 615 618 1356 1761 1768
2048, d9s7+2 1681 1445 1207 1320 617 1041 4158 4105 4137
2048, d9s8+1 1681 1449 1182 1311 620 1033 4025 4012 4029
2048, d9s9-0 1104 837 666 665 666 664 4154 4100 4125
2048, d15s15 767 843 657 657 658 653 4143 4117 4135
-
Running on my P4 Northwood I get an exception:
c000001eh = STATUS_INVALID_LOCK_SEQUENCE
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xme
mcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Ha
bran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33
104 84
--------------------------------------------------------------------------------
-------------
2048, d0s0-0 616 629 493 494 492 489 632
1322 1333
2048, d1s1-0 1092 734 537 541 547 543 3395
3225 3301
2048, d7s7-0 881 734 551 553 567 554 3408
3224 3312
2048, d7s8-1 1964 1662
Microsoft (R) DrWtsn32
Copyright (C) 1985-2001 Microsoft Corp. All rights reserved.
Application exception occurred:
App: \\P3\MemCopySSE2\MemCopySSE2.exe (pid=3728)
When: 5/30/2013 @ 11:33:52.343
Exception number: c000001e
()
*----> System Information <----*
Computer Name: DELL
User Name: User
Terminal Session Id: 0
Number of Processors: 2
Processor Type: x86 Family 15 Model 2 Stepping 9
Windows Version: 5.1
Current Build: 2600
Service Pack: 3
Current Type: Multiprocessor Free
Registered Organization:
Registered Owner: User
*----> Task List <----*
0 System Process
4 System
564 smss.exe
620 csrss.exe
644 winlogon.exe
688 services.exe
700 lsass.exe
876 svchost.exe
944 svchost.exe
1040 MsMpEng.exe
1076 svchost.exe
1180 svchost.exe
1252 svchost.exe
1476 spoolsv.exe
1696 svchost.exe
1744 cisvc.exe
1784 nvsvc32.exe
544 alg.exe
1716 Explorer.EXE
500 smax4pnp.exe
852 BCMSMMSG.exe
240 point32.exe
1144 msseces.exe
1504 ctfmon.exe
3472 cidaemon.exe
3728 MemCopySSE2.exe
3976 drwtsn32.exe
*----> Module List <----*
(0000000000400000 - 000000000040f000: \\P3\MemCopySSE2\MemCopySSE2.exe
(0000000076390000 - 00000000763ad000: C:\WINDOWS\system32\IMM32.DLL
(0000000077b40000 - 0000000077b62000: C:\WINDOWS\system32\Apphelp.dll
(0000000077c00000 - 0000000077c08000: C:\WINDOWS\system32\VERSION.dll
(0000000077c10000 - 0000000077c68000: C:\WINDOWS\system32\msvcrt.dll
(0000000077dd0000 - 0000000077e6b000: C:\WINDOWS\system32\ADVAPI32.dll
(0000000077e70000 - 0000000077f03000: C:\WINDOWS\system32\RPCRT4.dll
(0000000077f10000 - 0000000077f59000: C:\WINDOWS\system32\GDI32.dll
(0000000077f60000 - 0000000077fd6000: C:\WINDOWS\system32\SHLWAPI.dll
(0000000077fe0000 - 0000000077ff1000: C:\WINDOWS\system32\Secur32.dll
(000000007c800000 - 000000007c8f6000: C:\WINDOWS\system32\kernel32.dll
(000000007c900000 - 000000007c9b2000: C:\WINDOWS\system32\ntdll.dll
(000000007e410000 - 000000007e4a1000: C:\WINDOWS\system32\user32.dll
*----> State Dump for Thread Id 0xa44 <----*
eax=00000001 ebx=756e6547 ecx=00000080 edx=00000038 esi=0040dca0 edi=0040e4e0
eip=0040ad38 esp=0012ff98 ebp=0012fff0 iopl=0 nv up ei ng nz ac po cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000297
*** WARNING: Unable to verify checksum for \\P3\MemCopySSE2\MemCopySSE2.exe
*** ERROR: Module load completed but symbols could not be loaded for \\P3\MemCopySSE2\MemCopySSE2.exe
function: MemCopySSE2
0040ad1b cb retf
0040ad1c 660febca por xmm1,dx
0040ad20 660f7f0f movdqa oword ptr [edi],xmm1
0040ad24 8d7610 lea esi,[esi+0x10]
0040ad27 49 dec ecx
0040ad28 8d7f10 lea edi,[edi+0x10]
0040ad2b 75d9 jnz MemCopySSE2+0xad06 (0040ad06)
0040ad2d eb26 jmp MemCopySSE2+0xad55 (0040ad55)
0040ad2f f30f7e5608 movq xmm2,qword ptr [esi+0x8]
0040ad34 0f165610 movhps xmm2,qword ptr [esi+0x10]
FAULT ->0040ad38 f20ff0 repne ???
And same result for my second download.
-
Here is the culprit...
0040AD2F ³> ÃF30F7E56 08 Úmovq xmm2, [esi+8]
0040AD34 ³. ³0F1656 10 ³movhps xmm2, [esi+10]
0040AD38 ³. ³F20FF00E ³lddqu xmm1, [esi]
0040AD3C ³. ³660FD3CC ³psrlq xmm1, xmm4
0040AD40 ³. ³660FF3D3 ³psllq xmm2, xmm3
0040AD44 ³. ³660FEBCA ³por xmm1, xmm2
0040AD48 ³. ³660F7F0F ³movdqa [edi], xmm1
-
Thanks for the tests guys.
I made an update , i hope it is a bit faster now.
; Version2 of Fast memory copy. Can also be used as a fast string copy if you know the string lenght
Proc memcpy_SSE_V2:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax
mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
mov edx 0 ; here it is used as an index
L1:
movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
dec ecx ; ecx is our counter. It simply computes the lenght/16. Why 16 ? because we are jumping from 4 to 4 dwords. Which means that the loop is 16 x faster then using a regular byte by byte operation.
lea edx D$edx+2 ; we are copying the 128 bits. So instead simply inc by 1, we made it by 2, because each index holds only 8 bytes (limitation of the operand multiplication edx*8 / esi*8 etc)
; So, when edx = 0. edx*8 = 0. X$esi will point to esi+0 bytes
; when edx = 2. edx*8*2 = edx*16. X$esi will point to esi+16 bytes
; when edx = 4. edx*8*4 =edx*32. X$esi will point to esi+32 bytes.
; So. The important is that after each loop esiand edi must points 16 bytes ahead.
jnz L1<
emms ; clear the registers back to use on FPU
test eax eax | jz L4> ; No remainders ? Exit
jmp L9> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
L2:
; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
test eax eax | jz L4> ; No remainders ? Exit
L9:
lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
mov eax eax ; fix potential stallings
lea esi D$esi+edx*8 ; mul edx by 8 to get the pos
L3: movsb | dec eax | jnz L3<
L4:
EndP
Is this version a bit faster ? Or the alignment of lea to avoid stallings didn´t resolved ?
-
Hi Guga,
Could you please mark the lines that you changed? I'm using Masm, and it was quite a bit of work to adapt the syntax ;-)
-
Ok, many thanks...
OLD
Proc memcpy_SSE:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax
mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15
mov edx 0 ; here it is used as an index
L1:
movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
lea edx D$edx+2
dec ecx jnz L1<
emms ; clear the regsters back to use on FPU
shl edx 3 ; mul edx by 8 to get the pos
add edi edx
add esi edx
jmp L2> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. Can be 0 to 15
L2:
; If the memory of not 4 dword aligned we may have some remainder here So, just clean them.
While eax <> 0
movsb
dec eax
End_While
EndP
NEW
Proc memcpy_SSE_V2:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax
mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
mov edx 0 ; here it is used as an index
L1:
movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
dec ecx
lea edx D$edx+2
jnz L1<
emms ; clear the registers back to use on FPU
test eax eax | jz L4> ; No remainders ? Exit
jmp L9> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
L2:
; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
test eax eax | jz L4> ; No remainders ? Exit
L9:
lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
mov eax eax ; fix potential stallings
lea esi D$esi+edx*8 ; mul edx by 8 to get the pos
L3: movsb | dec eax | jnz L3<
L4:
EndP
I tried to assemble your test with masm, but there was an error on xmm registers. CPU mode problem. Not sure if i have the last masm version (It is from 2.011/2.012 is that the last one ?)
-
Ok, many thanks...
...
I tried to assemble your test with masm, but there was an error on xmm registers. CPU mode problem. Not sure if i have the last masm version (It is from 2.011/2.012 is that the last one ?)
Here it is, results below; I hope everything is correctly translated.
Re Masm: Use JWasm (http://www.japheth.de/JWasm.html) instead - fully compatible, less problems, much faster.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 88
---------------------------------------------------------------------------------------------
2048, d0s0-0 556 566 363 363 373 363 563 1051 944
2048, d1s1-0 1047 619 421 423 444 423 1684 1782 1705
2048, d7s7-0 567 619 418 420 446 420 1733 1782 1705
2048, d7s8-1 1677 1714 1090 441 1118 1118 1302 1337 1271
2048, d7s9-2 1677 1714 1090 441 1118 1118 1716 1782 1715
2048, d8s7+1 1655 1503 1090 888 980 975 1648 1245 1133
2048, d8s8-0 556 619 420 422 448 422 563 1051 944
2048, d8s9-1 1664 1714 1083 441 1118 1118 1661 1241 1137
2048, d9s7+2 1668 1502 1081 889 980 975 1763 1496 1425
2048, d9s8+1 1668 1502 1081 889 980 975 1283 1052 945
2048, d9s9-0 1047 619 420 422 448 422 1687 1498 1446
2048, d15s15 567 619 423 424 446 424 1679 1498 1445
-
Here the new results:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 88
---------------------------------------------------------------------------------------------
2048, d0s0-0 444 224 252 246 248 249 222 293 281
2048, d1s1-0 269 246 276 272 274 274 278 304 295
2048, d7s7-0 271 250 276 268 275 272 280 302 301
2048, d7s8-1 283 283 614 453 262 272 283 307 296
2048, d7s9-2 289 284 617 453 262 269 283 307 301
2048, d8s7+1 281 290 622 484 263 272 283 307 301
2048, d8s8-0 271 256 296 284 285 292 282 306 300
2048, d8s9-1 275 289 611 453 261 275 282 306 301
2048, d9s7+2 287 291 609 484 262 276 282 306 301
2048, d9s8+1 281 290 605 484 264 275 273 306 301
2048, d9s9-0 274 256 280 277 278 282 279 308 306
2048, d15s15 273 260 286 274 277 282 278 303 297
--- ok ---
Gunther
-
I still get the same exception.
And now that I check “LDDQU” in the Intel manual:
5.7 SSE3 INSTRUCTIONS
...
5.7.2 SSE3 Specialized 128-bit Unaligned Data Load Instruction
LDDQU Specialized 128-bit unaligned load designed to avoid cache line splits.
-
p4 prescott w/htt, xp mce2005, sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 88
---------------------------------------------------------------------------------------------
2048, d0s0-0 737 729 613 616 616 611 726 1593 1610
2048, d1s1-0 1102 833 641 641 648 647 4396 3939 3938
2048, d7s7-0 1005 850 648 647 650 655 4416 3935 4079
2048, d7s8-1 1367 1451 1208 870 615 619 4320 3817 3796
2048, d7s9-2 1369 1451 1217 869 618 619 4471 3920 3928
2048, d8s7+1 1360 1440 1186 1311 622 1043 1355 1781 1783
2048, d8s8-0 979 854 655 653 657 646 981 1592 1595
2048, d8s9-1 1354 1472 1204 870 616 619 1350 1764 1762
2048, d9s7+2 1682 1447 1181 1311 615 1037 4169 4122 4158
2048, d9s8+1 1682 1443 1175 1312 618 1033 4060 4030 4052
2048, d9s9-0 1103 835 665 664 664 665 4153 4116 4138
2048, d15s15 765 842 656 655 652 655 4169 4122 4142
-
I still get the same exception.
Here is one without lddqu.
-
Here is one without lddqu.
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
Algo memcpy MemCo1 MemCoL xmemcpy memcpy_S
Description CRT rep movs Masm32 Habran Guga
dest-al library
Code size ? 70 33 104 88
---------------------------------------------------------
2048, d0s0-0 628 642 642 1321 1331
2048, d1s1-0 1087 715 4164 3140 3187
2048, d7s7-0 867 713 3264 3325 7161
2048, d7s8-1 3359 3530 5099 4081 4124
2048, d7s9-2 3357 3528 7352 7144 7257
2048, d8s7+1 3180 3382 3188 4296 4293
2048, d8s8-0 1978 1412 1972 2137 2164
2048, d8s9-1 3310 3499 3103 4276 4257
2048, d9s7+2 4970 3384 5744 5617 5620
2048, d9s8+1 4952 3377 5117 3920 3935
2048, d9s9-0 2932 1368 6637 6043 6071
2048, d15s15 1295 1454 6638 6023 6079
--- ok ---
-
Hi JJ. Many thanks. It is perfectly translated the source :)
I´ll try JWasm.
MY results are:
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy memcpy_S
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran Guga
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33 104 88
---------------------------------------------------------------------------------------------
2048, d0s0-0 161 160 175 174 172 166 158 342 329
2048, d1s1-0 212 186 212 214 211 205 217 342 329
2048, d7s7-0 219 189 213 214 213 206 217 341 330
2048, d7s8-1 219 220 595 424 162 174 216 340 330
2048, d7s9-2 218 220 571 425 162 174 216 342 329
2048, d8s7+1 214 221 502 450 162 166 217 342 326
2048, d8s8-0 215 190 214 215 215 208 217 342 329
2048, d8s9-1 214 221 500 425 162 174 216 341 326
2048, d9s7+2 219 220 501 452 162 166 217 342 332
2048, d9s8+1 212 220 498 450 162 167 216 340 329
2048, d9s9-0 212 189 214 216 215 208 216 340 331
2048, d15s15 160 190 216 217 216 212 216 340 331
From what i saw, mine and Gunther´s results are pretty much stable (less variations). Probably due to the I7 processor, for what i saw from others results (Yours, dedndave and Michael).
But one thing i failed to understand (Actually 2 things :bgrin:).
1s) Why the results are higher then the one in crt ? I presume my algo is way to slow then :(
Ex: crt = 160 , mine 331 ?????
On Dedndave P4 Prescott mine is 4 times slower then M$ ?
2048, d7s7-0 1005 850 648 647 650 655 4416 3935 4079
Even MemCoL from masm32 is 4 times slower then M$ ???
Am i analyzing correct the results ?
2nd) Err.....Welll...i forgot what it was :icon_mrgreen: (A bit Tired)
It shouldn´t be important, or i´ll remember what i was not understanding of the results. :)
-
1s) Why the results are higher then the one in crt
The results depend strongly on the CPU. Some of the names in the columns (P4, Celeron) already show for which CPU the algos were optimised. Below is a big table showing some results.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 733 735 608 608 615 872 732
2048, d1s1-0 1100 821 649 649 643 653 4299
2048, d7s7-0 995 827 654 661 649 654 4324
2048, d7s8-1 1262 1339 1207 870 618 621 4319
2048, d7s9-2 1262 1341 1218 872 619 611 4340
2048, d8s7+1 1244 1333 1188 1213 620 916 1229
2048, d8s8-0 980 819 656 655 659 655 984
2048, d8s9-1 1228 1347 1210 870 613 621 1229
2048, d9s7+2 1584 1334 1176 1208 613 932 4029
2048, d9s8+1 1587 1333 1176 1209 618 929 4020
2048, d9s9-0 1101 821 660 659 659 661 4040
2048, d15s15 766 825 654 661 661 651 4031
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 556 566 363 363 373 363 560
2048, d1s1-0 1047 619 418 420 444 420 1723
2048, d7s7-0 567 619 419 421 446 421 1744
2048, d7s8-1 1474 1515 1090 441 962 965 1535
2048, d7s9-2 1473 1522 1090 448 970 975 1759
2048, d8s7+1 1464 1309 1090 698 817 822 1465
2048, d8s8-0 556 619 421 423 448 423 560
2048, d8s9-1 1465 1522 1083 441 961 962 1467
2048, d9s7+2 1481 1309 1081 765 824 832 1804
2048, d9s8+1 1481 1309 1081 696 818 821 1510
2048, d9s9-0 1047 619 421 423 448 423 1724
2048, d15s15 567 619 423 425 446 424 1718
Intel(R) Celeron(R) CPU 2.40GHz (SSE3 - jj Desktop)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 744 746 605 609 602 605 746
2048, d1s1-0 1098 827 657 657 647 653 4058
2048, d7s7-0 1004 824 662 658 658 662 4301
2048, d7s8-1 1240 1322 1222 1185 701 702 4285
2048, d7s9-2 1243 1336 1222 1221 697 694 4050
2048, d8s7+1 1214 1316 1190 1219 606 917 1216
2048, d8s8-0 980 830 663 667 666 667 981
2048, d8s9-1 1210 1334 1209 893 609 610 1212
2048, d9s7+2 1587 1316 1178 1282 694 989 4305
2048, d9s8+1 1586 1316 1178 1308 702 968 4304
2048, d9s9-0 1098 831 660 663 661 660 4305
2048, d15s15 753 831 675 675 674 675 4340
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4, BlackVortex (http://www.masm32.com/board/index.php?topic=11454.msg87622#msg87622))
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 1647 360 210 210 209 165 367
2048, d1s1-0 1649 396 259 266 258 219 1876
2048, d7s7-0 1631 402 265 261 261 218 1876
2048, d7s8-1 2159 1332 861 493 658 670 1439
2048, d7s9-2 2188 1338 862 493 658 666 1906
2048, d8s7+1 2151 1328 855 829 701 700 1393
2048, d8s8-0 1639 402 267 262 268 236 365
2048, d8s9-1 2205 1333 854 493 658 667 1290
2048, d9s7+2 2151 1329 849 828 700 701 1877
2048, d9s8+1 2143 1330 849 830 701 701 1471
2048, d9s9-0 1642 403 266 262 268 235 1884
2048, d15s15 1642 404 270 264 265 235 1893
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (SSE4, Ramguru (http://www.masm32.com/board/index.php?topic=11454.msg87601#msg87601))
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 1985 191 204 205 203 175 196
2048, d1s1-0 1992 223 251 248 251 222 244
2048, d7s7-0 1982 227 253 250 248 223 243
2048, d7s8-1 1976 246 592 501 193 211 243
2048, d7s9-2 1986 247 594 501 194 212 243
2048, d8s7+1 1982 249 587 501 193 179 244
2048, d8s8-0 1976 229 255 253 251 226 243
2048, d8s9-1 1983 249 588 502 194 211 243
2048, d9s7+2 1982 248 583 501 193 179 243
2048, d9s8+1 1977 249 583 501 193 179 243
2048, d9s9-0 1984 229 255 253 251 226 244
2048, d15s15 1974 228 256 256 258 226 243
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3, Mark Jones (http://www.masm32.com/board/index.php?topic=11454.msg87609#msg87609))
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 2079 551 359 424 424 359 547
2048, d1s1-0 2084 613 410 473 473 410 1060
2048, d7s7-0 2080 598 412 474 474 411 1059
2048, d7s8-1 2156 853 1016 564 567 569 802
2048, d7s9-2 2162 859 1016 564 567 568 1058
2048, d8s7+1 2172 849 868 564 565 566 804
2048, d8s8-0 2082 603 404 465 465 402 547
2048, d8s9-1 2177 848 995 564 565 568 803
2048, d9s7+2 2156 855 862 581 565 580 1060
2048, d9s8+1 2167 869 878 564 567 566 803
2048, d9s9-0 2090 595 412 472 472 409 1060
2048, d15s15 2064 592 410 470 486 408 1060
Intel(R) Pentium(R) 4 CPU 2.40GHz (SSE2: lddqu not possible, used movdqu)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 660 665 506 505 503 506 667
2048, d1s1-0 1088 712 554 548 550 555 3291
2048, d7s7-0 896 712 572 567 552 576 3288
2048, d7s8-1 1492 1630 1385 1134 1411 1396 3254
2048, d7s9-2 1492 1694 1385 1239 1415 1432 3992
2048, d8s7+1 1876 1565 1391 2085 1335 1377 1890
2048, d8s8-0 895 714 562 559 562 566 902
2048, d8s9-1 1516 1693 1374 1167 1383 1364 1532
2048, d9s7+2 2298 1567 1365 2220 1341 1323 3300
2048, d9s8+1 2298 1564 1363 2169 1382 1319 3261
2048, d9s9-0 1090 712 564 560 559 559 3327
2048, d15s15 683 709 565 575 578 564 3337
-
OK, i understood better now.
So, for what i saw, using the SSE mnemonics as i did needs to be done together with a strategic (code) optimization. Analysing memcpy from msvcrt, i saw that they checks for the alignment of the address and go jmp here and there searching the best aligned addresses that can be copied.
Although the code is kinda big, the usage of this strategy is interesting, specially considering that the mnemonics i used has a latency of 8 cycles, while the ones from M$ have 2 or 4. Although the code seems to be a bit stable on several CPUs (the results don´t varies that much for aligned or unaligned data), it is possible have better results. For instance, if on a I7, mine results and the ones from msvcrt are similar, and i´m using mnemonics that takes 4 x more cycles to operate, then, it is possible to achieve a better result that can fits for all CPUs.
-
Well, try yourself... I have rearranged the columns so that you can see your algo side-by-side with CRT and (to the right) good ol' movs.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
Algo memcpy memcpy_S MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT Guga rep movs movdqa lps+hps movdqa movdqa Masm32 Habran
dest-al psllq CeleronM dest-al src-al library
Code size ? 88 59 291 222 200 269 33 104
---------------------------------------------------------------------------------------------
2048, d0s0-0 546 550 564 359 424 424 359 561 541
2048, d1s1-0 719 800 573 410 482 473 425 1060 797
2048, d7s7-0 723 799 574 412 474 474 411 1059 797
2048, d7s8-1 809 543 835 1017 563 566 567 802 543
2048, d7s9-2 809 799 832 1017 563 566 567 1058 797
2048, d8s7+1 802 544 808 872 563 572 565 804 605
2048, d8s8-0 728 541 550 404 479 465 402 553 541
2048, d7s3+4 561 798 573 901 576 577 573 1058 802
2048, d3s7-4 551 802 575 1040 563 566 567 1060 814
2048, d8s9-1 820 544 809 994 576 575 567 804 606
2048, d9s7+2 808 798 846 862 565 564 571 1066 798
2048, d9s8+1 808 543 846 862 563 566 565 803 543
2048, d9s9-0 721 808 574 411 474 487 409 1061 798
2048, d15s15 723 798 588 410 470 471 408 1059 814