### Author Topic: reason to switch to 64 Bit Assembler  (Read 49756 times)

#### habran

• Member
• Posts: 1225
##### Re: reason to switch to 64 Bit Assembler
« Reply #15 on: February 11, 2013, 05:49:30 AM »
thanks qWord
Cod-Father

#### qWord

• Member
• Posts: 1473
• The base type of a type is the type itself
##### Re: reason to switch to 64 Bit Assembler
« Reply #16 on: February 11, 2013, 06:23:36 AM »
habran,
your compare between C and ASM in your first post is unfair, because is obviously a debug build.
Also, a smarter "algorithm" would probably produce much better results. e.g. something like this (not tested):
Code: [Select]
`void* xmemcpy(void *dest, void *src, unsigned int cb){ unsigned int cnt1 = cb>>((sizeof(char*)==8)?3:2); unsigned int cnt2 = cb&(sizeof(char*)-1); char** p1 = (char**)dest; char** p2 = (char**)src; char* p3; char* p4; for(;cnt1--;p1++,p2++) *p1 = *p2; p3 = (char*)p1; p4 = (char*)p2; if (sizeof(char*) == 8) // dead code for x32 if(cnt2&4) { *((int*)p3)= *((int*)p4); p3+=4;p4+=4;cnt2-=4; } for(;cnt2--;p3++,p4++) *p3 = *p4; return dest;}`
« Last Edit: February 11, 2013, 10:15:23 AM by qWord »
MREAL macros - when you need floating point arithmetic while assembling!

#### jj2007

• Member
• Posts: 10543
• Assembler is fun ;-)
##### Re: reason to switch to 64 Bit Assembler
« Reply #17 on: February 11, 2013, 09:07:53 AM »
habran,
your compare between C and ASM in your first post is unfair, because is obviously a debug build.
Also, a smarter "algorithm" would probably produce much better results.

I wonder how efficient this code from the "64 beauty" example is (can't test it, unfortunately):

000000014004DA90  dec         r8
000000014004DA93  and         r8,r8 <<< no need for that, the flag is already set
000000014004DA96  je          xmemcpy+22h (14004DA9Ah) <<< why not jne xmemcpy+0Eh?? static branch prediction rules would suggest that it is even faster...
000000014004DA98  jmp         xmemcpy+0Eh (14004DA86h) <<< can be dropped entirely.
000000014004DA9A  mov         rax,r9

Again, timings would be nice ;-)

#### frktons

• Member
• Posts: 491
##### Re: reason to switch to 64 Bit Assembler
« Reply #18 on: February 11, 2013, 10:35:01 AM »
For what I recall, a memcopy done with native 64 bit registers, in 64
bit systems, is the fastest solution found when we tested, a couple
of years ago, XMM/SSE2 code for this kind of operation.

The test was done on a 32 MB buffer that was simply blanked, not really a memcopy
but it was set just to measure the performance of REP STOSQ vs MOVNTDQ
and measured via rdtsc.
The results were like:
Quote
Clearing done
117,940,861 clocks for a 33,554,432 bytes buffer with using REP STOSQ

Clearing done
1,208,750,068 clocks for a 33,554,432 bytes buffer with using MOVNTDQ

Code from Alex.

I agree with habran, as I said at the time, for many reasons, but I also
understand why years of work are not easily dropped or rewritten.  :t

Frank

#### Gunther

• Member
• Posts: 3585
• Forgive your enemies, but never forget their names
##### Re: reason to switch to 64 Bit Assembler
« Reply #19 on: February 11, 2013, 10:51:18 AM »
Frank,

For what I recall, a memcopy done with native 64 bit registers, in 64
bit systems, is the fastest solution found when we tested, a couple
of years ago, XMM/SSE2 code for this kind of operation.

the situation has changed dramatically since the advent of Intel's AVX. We should do the test again.

Gunther
Get your facts first, and then you can distort them.

#### frktons

• Member
• Posts: 491
##### Re: reason to switch to 64 Bit Assembler
« Reply #20 on: February 11, 2013, 11:05:20 AM »

the situation has changed dramatically since the advent of Intel's AVX. We should do the test again.

Gunther

I think a new test can only confirm that 64 bit mov operations are
faster than 32 bit ones. If anyone has a new processor, say habran, and
the skill to use AVX code, he could do it.
Not that difficult if he really likes to do the test, I can post the 64 bit MASM
code that I used 2 years ago. No AVX because neither Alex's, nor my PC are
AVX able.

#### habran

• Member
• Posts: 1225
##### Re: reason to switch to 64 Bit Assembler
« Reply #21 on: February 11, 2013, 11:35:59 AM »
I have tested a speed of 64 bit and result is 1:4 against C
for one pass JWASM is 80   or 50h
and C is 207    intersting :P (JJ2007) 0CFh
JJ207 you are correct that more optimization could be done to it
however my intent in this case was not so much focused on that but on beauty and simplicity of 64 bit JWASM
I have used ".for" loop which is portable, readable and easy to use but it can not beat human eyes and brains

thank you Frank for supporting me that's what friends are for :t

qWord,
Quote
your compare between C and ASM in your first post is unfair, because is obviously a debug build
all of them are debug built because I needed to read a code in memory :icon_eek:
your function looks good and I will test it later
Quote
however, we can write more complex EG: xxxmemcpy which would be able to calculate the size of data
and than first transfer all possible QWORDS and than if left last DWORD and than if left last WORD and than if left last BYTE
EG: data size is 256+7 EQU  32 QWORDS, 1 DWORD, 1 WORD, and 1 BYTE
I think I have seen already on internet written similar function but I can't remember was it in C or assembler
UNFAIR :icon_eek:
what is fair in this world??? life is a bitch!
these days even death is not fair any more, if you are rich you by yourself brand new organs and live as long as you want :P
Cod-Father

#### habran

• Member
• Posts: 1225
##### Re: reason to switch to 64 Bit Assembler
« Reply #22 on: February 11, 2013, 11:43:07 AM »
hi Frank,
I can try to do that dough I did not learn yet AVX
I am ready for another challenge, I am not a chicken :lol:
I have to go now to earn my living, "I'll be back 8)"
Cod-Father

#### jj2007

• Member
• Posts: 10543
• Assembler is fun ;-)
##### Re: reason to switch to 64 Bit Assembler
« Reply #23 on: February 11, 2013, 11:48:15 AM »
intersting :P (JJ2007) 0CFh

What do you mean with that? ::)

#### qWord

• Member
• Posts: 1473
• The base type of a type is the type itself
##### Re: reason to switch to 64 Bit Assembler
« Reply #24 on: February 11, 2013, 12:53:56 PM »
I have tested a speed of 64 bit and result is 1:4 against C
I can't confirm that: my own quick test shows that there is nearly no difference between your .for-loop and xmemcpy.
Code: [Select]
`Function:   xmemcpy    xmemcpy2   xmemcpy_Q  xmemcpy_Q2 memcpy     @ForLoop --- buffer size = 13 ---align +0    29         17         4          2          8          33align +1    29         17         5          3          10         32align +2    29         17         7          3          13         32align +3    29         17         5          2          11         33align +4    29         17         4          2          11         32align +5    29         17         5          3          11         32align +6    29         17         4          3          11         32align +7    29         17         5          2          11         32align +8    29         17         4          2          11         32align +9    29         17         5          3          11         32align +10   29         18         4          3          11         32align +11   29         17         5          2          11         32align +12   29         17         4          2          11         32align +13   29         17         5          3          11         32align +14   29         17         4          4          11         32align +15   29         17         5          3          11         32 --- buffer size = 33 ---align +0    93         46         10         10         10         77align +1    73         46         10         7          10         76align +2    77         46         10         7          10         76align +3    92         46         10         7          10         77align +4    73         46         10         7          10         76align +5    82         46         10         7          10         76align +6    73         46         10         7          10         77align +7    91         47         10         7          10         76align +8    73         47         11         7          13         76align +9    73         47         10         7          10         91align +10   73         46         10         7          10         76align +11   73         47         10         7          11         77align +12   76         47         10         7          10         76align +13   86         47         10         7          10         76align +14   74         47         10         7          10         76align +15   84         47         18         8          10         76 --- buffer size = 59 ---align +0    124        97         21         14         17         134align +1    152        98         22         15         17         135align +2    129        98         23         15         17         134align +3    129        102        22         15         17         135align +4    129        98         22         14         17         137align +5    129        98         21         16         18         139align +6    128        98         21         16         17         133align +7    129        98         21         15         17         135align +8    129        98         21         14         17         134align +9    128        98         21         14         17         135align +10   128        98         21         15         17         135align +11   129        98         20         15         17         134align +12   129        98         21         14         17         134align +13   133        98         21         15         17         135align +14   135        99         21         15         16         135align +15   127        98         21         15         17         134 --- buffer size = 590 ---align +0    920        908        150        123        65         1041align +1    915        886        149        124        62         1040align +2    922        906        149        124        82         1048align +3    925        887        150        124        64         1037align +4    920        891        150        127        63         1103align +5    918        892        149        124        64         1042align +6    974        897        157        128        82         1087align +7    938        888        149        124        64         1032align +8    921        889        151        123        65         1070align +9    941        887        154        124        85         1051align +10   937        888        150        124        63         1056align +11   953        887        150        123        64         1013align +12   920        897        151        124        63         1039align +13   925        900        150        123        63         1017align +14   938        892        151        124        63         1090align +15   927        889        156        125        64         1053  ---   Functions ----  xmemcpy    : habran , PellesC  xmemcpy2   : habran , VC 2012  xmemcpy_Q  : qWord  , PellesC  xmemcpy_Q2 : qWord  , VC 2012  memcpy     : MSVCRT  @ForLoop   : habran only alignment of Src varies, Dest is allocated by HeapAlloc()Press any key to continue ...`
MREAL macros - when you need floating point arithmetic while assembling!

#### habran

• Member
• Posts: 1225
##### Re: reason to switch to 64 Bit Assembler
« Reply #25 on: February 11, 2013, 03:17:47 PM »
qWord,
I have used counter_begin and counter_end as MACROS like this
Code: [Select]
`local buff[256]:BYTE    counter_begin 1,1    invoke xmemcpy,ADDR buff,CTEXT("habran is very smart cooker"), 27    counter_end`and I've got above mentioned results
do you want to say that I lied :icon_eek:

however, I don't believe in your testing because, looking in a C source everyone can see that there is much more
job for processor and also accessing memory in C than ASM

are you sure that your testing is correct
if so I will go back to C64

Cod-Father

#### habran

• Member
• Posts: 1225
##### Re: reason to switch to 64 Bit Assembler
« Reply #26 on: February 11, 2013, 03:25:23 PM »
JJ2007,
Quote
What do you mean with that?

207 reminded me on 2007 and it is funny because C knows that you don't like it

BTW 2007 reminded me on two James Bonds or double agent 007
what actually you are doing in Italy? 8)
Cod-Father

#### dedndave

• Member
• Posts: 8827
• Still using Abacus 2.0
##### Re: reason to switch to 64 Bit Assembler
« Reply #27 on: February 11, 2013, 04:05:21 PM »

#### habran

• Member
• Posts: 1225
##### Re: reason to switch to 64 Bit Assembler
« Reply #28 on: February 11, 2013, 04:30:28 PM »
thanks dedndave :P :
Bye everyone
Cod-Father

#### Gunther

• Member
• Posts: 3585
• Forgive your enemies, but never forget their names
##### Re: reason to switch to 64 Bit Assembler
« Reply #29 on: February 11, 2013, 06:25:49 PM »
Frank,

I think a new test can only confirm that 64 bit mov operations are
faster than 32 bit ones. If anyone has a new processor, say habran, and
the skill to use AVX code, he could do it.
Not that difficult if he really likes to do the test, I can post the 64 bit MASM
code that I used 2 years ago. No AVX because neither Alex's, nor my PC are
AVX able.

Gunther
Get your facts first, and then you can distort them.