News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

reason to switch to 64 Bit Assembler

Started by habran, February 10, 2013, 08:03:46 PM

Previous topic - Next topic

habran

Cod-Father

qWord

#16
habran,
your compare between C and ASM in your first post is unfair, because is obviously a debug build.
Also, a smarter "algorithm" would probably produce much better results. e.g. something like this (not tested):
void* xmemcpy(void *dest, void *src, unsigned int cb)
{
unsigned int cnt1 = cb>>((sizeof(char*)==8)?3:2);
unsigned int cnt2 = cb&(sizeof(char*)-1);
char** p1 = (char**)dest;
char** p2 = (char**)src;
char* p3;
char* p4;

for(;cnt1--;p1++,p2++)
*p1 = *p2;

p3 = (char*)p1;
p4 = (char*)p2;

if (sizeof(char*) == 8) // dead code for x32
if(cnt2&4)
{ *((int*)p3)= *((int*)p4);
p3+=4;p4+=4;cnt2-=4;
}

for(;cnt2--;p3++,p4++)
*p3 = *p4;

return dest;
}
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: qWord on February 11, 2013, 06:23:36 AM
habran,
your compare between C and ASM in your first post is unfair, because is obviously a debug build.
Also, a smarter "algorithm" would probably produce much better results.

I wonder how efficient this code from the "64 beauty" example is (can't test it, unfortunately):

000000014004DA90  dec         r8   
000000014004DA93  and         r8,r8 <<< no need for that, the flag is already set
000000014004DA96  je          xmemcpy+22h (14004DA9Ah) <<< why not jne xmemcpy+0Eh?? static branch prediction rules would suggest that it is even faster...
000000014004DA98  jmp         xmemcpy+0Eh (14004DA86h) <<< can be dropped entirely.
000000014004DA9A  mov         rax,r9


Again, timings would be nice ;-)

frktons

For what I recall, a memcopy done with native 64 bit registers, in 64
bit systems, is the fastest solution found when we tested, a couple
of years ago, XMM/SSE2 code for this kind of operation.

The test was done on a 32 MB buffer that was simply blanked, not really a memcopy
but it was set just to measure the performance of REP STOSQ vs MOVNTDQ
and measured via rdtsc.
The results were like:
Quote
Clearing done
117,940,861 clocks for a 33,554,432 bytes buffer with using REP STOSQ

Clearing done
1,208,750,068 clocks for a 33,554,432 bytes buffer with using MOVNTDQ

Code from Alex.

I agree with habran, as I said at the time, for many reasons, but I also
understand why years of work are not easily dropped or rewritten.  :t


Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Gunther

Frank,

Quote from: frktons on February 11, 2013, 10:35:01 AM
For what I recall, a memcopy done with native 64 bit registers, in 64
bit systems, is the fastest solution found when we tested, a couple
of years ago, XMM/SSE2 code for this kind of operation.

the situation has changed dramatically since the advent of Intel's AVX. We should do the test again.

Gunther
You have to know the facts before you can distort them.

frktons

Quote from: Gunther on February 11, 2013, 10:51:18 AM

the situation has changed dramatically since the advent of Intel's AVX. We should do the test again.

Gunther

I think a new test can only confirm that 64 bit mov operations are
faster than 32 bit ones. If anyone has a new processor, say habran, and
the skill to use AVX code, he could do it.
Not that difficult if he really likes to do the test, I can post the 64 bit MASM
code that I used 2 years ago. No AVX because neither Alex's, nor my PC are
AVX able.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

habran

I have tested a speed of 64 bit and result is 1:4 against C
for one pass JWASM is 80   or 50h
and C is 207    intersting :P (JJ2007) 0CFh
JJ207 you are correct that more optimization could be done to it
however my intent in this case was not so much focused on that but on beauty and simplicity of 64 bit JWASM
I have used ".for" loop which is portable, readable and easy to use but it can not beat human eyes and brains

thank you Frank for supporting me that's what friends are for :t

qWord,
Quoteyour compare between C and ASM in your first post is unfair, because is obviously a debug build
all of them are debug built because I needed to read a code in memory :icon_eek:
your function looks good and I will test it later
Quote
however, we can write more complex EG: xxxmemcpy which would be able to calculate the size of data
and than first transfer all possible QWORDS and than if left last DWORD and than if left last WORD and than if left last BYTE
EG: data size is 256+7 EQU  32 QWORDS, 1 DWORD, 1 WORD, and 1 BYTE
I think I have seen already on internet written similar function but I can't remember was it in C or assembler
UNFAIR :icon_eek:
what is fair in this world??? life is a bitch!
these days even death is not fair any more, if you are rich you by yourself brand new organs and live as long as you want :P
Cod-Father

habran

hi Frank,
I can try to do that dough I did not learn yet AVX
I am ready for another challenge, I am not a chicken :lol:
I have to go now to earn my living, "I'll be back 8)"
Cod-Father

jj2007


qWord

Quote from: habran on February 11, 2013, 11:35:59 AM
I have tested a speed of 64 bit and result is 1:4 against C
I can't confirm that: my own quick test shows that there is nearly no difference between your .for-loop and xmemcpy.
Function:   xmemcpy    xmemcpy2   xmemcpy_Q  xmemcpy_Q2 memcpy     @ForLoop
--- buffer size = 13 ---
align +0    29         17         4          2          8          33
align +1    29         17         5          3          10         32
align +2    29         17         7          3          13         32
align +3    29         17         5          2          11         33
align +4    29         17         4          2          11         32
align +5    29         17         5          3          11         32
align +6    29         17         4          3          11         32
align +7    29         17         5          2          11         32
align +8    29         17         4          2          11         32
align +9    29         17         5          3          11         32
align +10   29         18         4          3          11         32
align +11   29         17         5          2          11         32
align +12   29         17         4          2          11         32
align +13   29         17         5          3          11         32
align +14   29         17         4          4          11         32
align +15   29         17         5          3          11         32
--- buffer size = 33 ---
align +0    93         46         10         10         10         77
align +1    73         46         10         7          10         76
align +2    77         46         10         7          10         76
align +3    92         46         10         7          10         77
align +4    73         46         10         7          10         76
align +5    82         46         10         7          10         76
align +6    73         46         10         7          10         77
align +7    91         47         10         7          10         76
align +8    73         47         11         7          13         76
align +9    73         47         10         7          10         91
align +10   73         46         10         7          10         76
align +11   73         47         10         7          11         77
align +12   76         47         10         7          10         76
align +13   86         47         10         7          10         76
align +14   74         47         10         7          10         76
align +15   84         47         18         8          10         76
--- buffer size = 59 ---
align +0    124        97         21         14         17         134
align +1    152        98         22         15         17         135
align +2    129        98         23         15         17         134
align +3    129        102        22         15         17         135
align +4    129        98         22         14         17         137
align +5    129        98         21         16         18         139
align +6    128        98         21         16         17         133
align +7    129        98         21         15         17         135
align +8    129        98         21         14         17         134
align +9    128        98         21         14         17         135
align +10   128        98         21         15         17         135
align +11   129        98         20         15         17         134
align +12   129        98         21         14         17         134
align +13   133        98         21         15         17         135
align +14   135        99         21         15         16         135
align +15   127        98         21         15         17         134
--- buffer size = 590 ---
align +0    920        908        150        123        65         1041
align +1    915        886        149        124        62         1040
align +2    922        906        149        124        82         1048
align +3    925        887        150        124        64         1037
align +4    920        891        150        127        63         1103
align +5    918        892        149        124        64         1042
align +6    974        897        157        128        82         1087
align +7    938        888        149        124        64         1032
align +8    921        889        151        123        65         1070
align +9    941        887        154        124        85         1051
align +10   937        888        150        124        63         1056
align +11   953        887        150        123        64         1013
align +12   920        897        151        124        63         1039
align +13   925        900        150        123        63         1017
align +14   938        892        151        124        63         1090
align +15   927        889        156        125        64         1053

  ---   Functions ----
  xmemcpy    : habran , PellesC
  xmemcpy2   : habran , VC 2012
  xmemcpy_Q  : qWord  , PellesC
  xmemcpy_Q2 : qWord  , VC 2012
  memcpy     : MSVCRT
  @ForLoop   : habran

only alignment of Src varies, Dest is allocated by HeapAlloc()

Press any key to continue ...
MREAL macros - when you need floating point arithmetic while assembling!

habran

qWord,
I have used counter_begin and counter_end as MACROS like this

local buff[256]:BYTE

    counter_begin 1,1
    invoke xmemcpy,ADDR buff,CTEXT("habran is very smart cooker"), 27
    counter_end

and I've got above mentioned results
do you want to say that I lied :icon_eek:

however, I don't believe in your testing because, looking in a C source everyone can see that there is much more
job for processor and also accessing memory in C than ASM

are you sure that your testing is correct
if so I will go back to C64  :bgrin:

Cod-Father

habran

JJ2007,
QuoteWhat do you mean with that?

207 reminded me on 2007 and it is funny because C knows that you don't like it :biggrin:

BTW 2007 reminded me on two James Bonds or double agent 007
what actually you are doing in Italy? 8)
Cod-Father


habran

Cod-Father

Gunther

Frank,

Quote from: frktons on February 11, 2013, 11:05:20 AM
I think a new test can only confirm that 64 bit mov operations are
faster than 32 bit ones. If anyone has a new processor, say habran, and
the skill to use AVX code, he could do it.
Not that difficult if he really likes to do the test, I can post the 64 bit MASM
code that I used 2 years ago. No AVX because neither Alex's, nor my PC are
AVX able.

I can do that next weekend; please post your code.

Gunther
You have to know the facts before you can distort them.