reason to switch to 64 Bit Assembler

jj2007 · February 21, 2013, 01:31:28 PM

Quote from: habran on February 21, 2013, 01:14:23 PM
C is not a plug it is a programming language for Christ sake

It's spelled "plague", Habran.

habran · February 21, 2013, 01:49:47 PM

thanks wise man JJ2007 :t
what kind of spelling checker is that when it did not worn me!!! ;)
I will reward you for that and only you can use it, you deserved it! ;)
here is for you changed source:

Code Select


xmemcpy uses ebx dest:DWORD,src:DWORD,count:DWORD
    mov ecx,dest
    mov edx,src
    mov ebx,count
    .if (ecx!=edx)
       shr ebx,1
       .if (CARRY?)
          mov al,[edx]
          mov [ecx],al
          inc ecx
          inc edx
       .endif
       shr ebx,1
       .if (CARRY?)
          mov ax,[edx]
          mov [ecx],ax
          add ecx,2
          add edx,2
       .endif
       shr ebx,1
       .if (CARRY?)
          mov eax,[edx]
          mov [ecx],eax 
          add ecx,4
          add edx,4
       .endif
       shr ebx,1
       .if (CARRY?)
          movq xmm4,[edx]
          movq [ecx],xmm4
          add ecx,8
          add edx,8 
       .endif
       .while (ebx)
          movdqu xmm4,[edx]
          movdqu [ecx],xmm4
          add ecx,16
          add edx,16
          dec ebx
       .endw
   .endif 
   mov eax,dest    
   ret                     
xmemcpy ENDP

habran · February 21, 2013, 04:05:16 PM

hey 2007,
are you going to abandon me because of a little spelling mistake :icon_eek:
plug, plague, plug in,plug out, ear plug, plagiarism... who cares

you are just trying to mask the main issue: compiling JWasm

those ENGLEZE have made mess with unnecessary complex spelling just to tease pure strangers :exclaim:
they messed it up so much that even they can not write "for sale" but use "4 sale" :icon_confused:

japheth · February 21, 2013, 06:43:56 PM

Hello,

Quote from: habran on February 21, 2013, 12:08:50 PM
if you don't have M$VC you can compile it with PelesC

it's mentioned in jwasm's readme, but since nobody reads readmes, I'll repeat it here: better do NOT use PellesC to compile JWasm - the jwasm binary created by PellesC is unable to pass the regression tests supplied with the assembler. I haven't analyzed the problem too deeply, but judging from the part that fails I assume that floating-point constants don't have the values as they should.

Good compilers are: Open Watcom, MSVC, GCC (MinGW)

jj2007 · February 21, 2013, 09:12:31 PM

Quote from: habran on February 21, 2013, 01:49:47 PM
thanks wise man JJ2007 :t
...
here is for you changed source:
xmemcpy uses ebx dest:DWORD,src:DWORD,count:DWORD
mov ecx,dest
mov edx,src
...
ret
xmemcpy ENDP

Thanks, it looks competitive :t

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

9458 cycles for 100 * xmemcpy
8056 cycles for 100 * MbCopy

9292 cycles for 100 * xmemcpy
7893 cycles for 100 * MbCopy

9289 cycles for 100 * xmemcpy
8072 cycles for 100 * MbCopy

habran · February 21, 2013, 10:43:26 PM

Hi Japheth,

Quotedo NOT use PellesC to compile JWasm

sorry for misunderstanding

I've read it but I thought that it applies only to 64 bit

jj2007,
thanks for testing it
this version is created for unaligned data as I mentioned before
can you please try to compare when not aligned at all?

habran · February 21, 2013, 10:46:27 PM

jj2007,
this is what my machine produce from your test:

Code Select


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
loop overhead is approx. 155/100 cycles

2242    cycles for 100 * xmemcpy
5356    cycles for 100 * MbCopy

2239    cycles for 100 * xmemcpy
5455    cycles for 100 * MbCopy

2243    cycles for 100 * xmemcpy
5166    cycles for 100 * MbCopy


--- ok ---

as double as fast as yours, wouldn't you say so :shock:

QuoteThanks, it looks competitive

I would say It looks downright stunning!!!! :t

japheth · February 22, 2013, 12:14:38 AM

Quote from: habran on February 21, 2013, 10:46:27 PM
I would say It looks downright stunning!!!! :t

I fully agree! However - almost 100% faster than MB - which allegedly is already rocket-science? How is this possible? You must do something wrong...

habran · February 22, 2013, 12:15:53 AM

here is 64 bit without .for:

Code Select


xmemcpy ENDP
option win64:0
OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE
xmemcpy PROC dest:QWORD,src :QWORD, count:UINT_PTR
         mov rax,rcx
         .if (rcx!=rdx)
   		 shr r8,1
               .if (CARRY?)
	            mov r9b,[rdx]
      		    mov [rcx],r9b
		    inc rcx
	            inc rdx
   	       .endif
                shr r8,1
   		 .if (CARRY?)
	             mov r9w,[rdx]
		     mov [rcx],r9w
	             add rcx,2
	             add rdx,2
	         .endif
             shr r8,1
            .if (CARRY?)
	          mov r9d,[rdx]
	          mov [rcx],r9d 
	          add rcx,4
	          add rdx,4
            .endif
            shr r8,1
            .if (CARRY?)
               mov r9,[rdx]
               mov [rcx],r9
               add rcx,8
               add rdx,8
	         .endif
            shr r8,1
               .if (CARRY?)
               movdqu xmm4,[rdx]
               movdqu [rcx],xmm4
               add rcx,16
               add rdx,16
	   .endif
            .while (r8)         
      	        vmovdqu ymm4,[rdx]
      	        vmovdqu [rcx],ymm4
      		add rcx,32
      		add rdx,32
      		dec r8
	      .endw
             .endif             
         ret                                              
xmemcpy ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

habran · February 22, 2013, 12:22:58 AM

Japheth,

Quote
I fully agree! However - almost 100% faster than MB - which allegedly is already rocket-science? How is this possible? You must do something wrong...

I did not touch code I just executed JJ's exe on my machine
and I can do it again now, let see:

Code Select


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
loop overhead is approx. 154/100 cycles

2237    cycles for 100 * xmemcpy
5243    cycles for 100 * MbCopy

2240    cycles for 100 * xmemcpy
5176    cycles for 100 * MbCopy

2233    cycles for 100 * xmemcpy
5163    cycles for 100 * MbCopy


--- ok ---

Japheth, why don't you try it in your machine?

japheth · February 22, 2013, 12:35:02 AM

Quote from: habran on February 22, 2013, 12:22:58 AM
Japheth, why don't you try it in your machine?

The fastest machine that I have available is an 5 year old AMD 64 X2 5000+.

jj2007 · February 22, 2013, 01:00:54 AM

Cool down. If it's faster than the MasmBasic algo, it just means it is faster on your CPU. Well optimised for your CPU.

In case you like it less superficially (d7=destination is align 16+7, s3=src is 16+3 etc):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 104
------------------------------------------------------------------------------------
2048, d0s0-0 561 549 360 439 424 361 547 541
2048, d1s1-0 720 597 410 473 473 421 1061 798
2048, d7s7-0 721 598 412 474 474 412 1060 798
2048, d7s8-1 809 851 1016 578 566 582 802 558
2048, d7s9-2 809 853 1016 567 566 567 1058 798
2048, d8s7+1 810 851 868 563 564 565 819 607
2048, d8s8-0 738 587 404 465 480 416 547 541
2048, d8s9-1 801 848 994 563 564 567 804 606
2048, d9s7+2 824 864 862 565 564 579 1060 798
2048, d9s8+1 808 853 862 564 567 565 803 543
2048, d9s9-0 721 595 411 472 472 409 1061 798
2048, d15s15 722 591 425 480 486 422 1072 798

Your algo is pretty good, but for the (frequent) aligned case, there are four algos that perform better on my AMD.

dedndave · February 22, 2013, 02:32:43 AM

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 104
------------------------------------------------------------------------------------
2048, d0s0-0 717 719 608 609 904 610 718 1590
2048, d1s1-0 1100 846 651 651 650 650 4435 3945
2048, d7s7-0 1003 849 656 657 656 655 4437 3952
2048, d7s8-1 1368 1445 1223 868 611 613 4303 3799
2048, d7s9-2 1367 1446 1224 867 611 611 4454 3929
2048, d8s7+1 1338 1446 1188 1342 611 1023 1343 1748
2048, d8s8-0 976 849 656 657 657 656 977 1588
2048, d8s9-1 1332 1470 1212 873 611 612 1333 1733
2048, d9s7+2 1663 1440 1179 1342 611 1023 4150 4085
2048, d9s8+1 1660 1439 1182 1343 610 1023 4026 4014
2048, d9s9-0 1098 850 664 667 664 664 4135 4127
2048, d15s15 770 853 664 665 662 664 4136 4108

Gunther · February 22, 2013, 04:17:44 AM

Here the test results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 104
------------------------------------------------------------------------------------
2048, d0s0-0 427 223 251 248 247 250 224 292
2048, d1s1-0 275 251 275 270 277 273 274 303
2048, d7s7-0 275 253 282 273 278 276 274 303
2048, d7s8-1 279 271 617 453 247 269 273 303
2048, d7s9-2 279 272 617 450 254 269 274 303
2048, d8s7+1 275 270 621 483 256 272 274 304
2048, d8s8-0 275 255 295 284 288 291 274 303
2048, d8s9-1 275 271 610 452 254 269 273 294
2048, d9s7+2 283 272 611 486 262 276 276 309
2048, d9s8+1 287 277 612 486 261 276 274 309
2048, d9s9-0 280 260 287 280 281 285 280 309
2048, d15s15 280 260 287 281 282 286 280 309

Gunther

jj2007 · February 22, 2013, 05:30:19 AM

One more - not by accident, #4 was named "CeleronM" ;-)

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)

Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 104
------------------------------------------------------------------------------------
2048, d0s0-0 556 566 363 363 373 363 563 1051
2048, d1s1-0 1047 619 421 423 444 423 1683 1782
2048, d7s7-0 567 619 418 420 446 420 1699 1782
2048, d7s8-1 1677 1714 1090 441 1118 1123 1302 1337
2048, d7s9-2 1677 1713 1090 441 1118 1123 1716 1782
2048, d8s7+1 1655 1502 1090 857 979 975 1647 1245
2048, d8s8-0 556 619 420 422 448 422 563 1051
2048, d8s9-1 1664 1714 1083 441 1118 1123 1661 1241
2048, d9s7+2 1668 1502 1081 857 979 975 1762 1495
2048, d9s8+1 1668 1502 1081 857 979 975 1283 1052
2048, d9s9-0 1047 619 420 422 448 422 1686 1497
2048, d15s15 567 619 422 424 446 424 1678 1497

The MASM Forum

News:

reason to switch to 64 Bit Assembler

jj2007

habran

habran

japheth

jj2007

habran

habran

japheth

habran

habran

japheth

jj2007

dedndave

Gunther

jj2007