reason to switch to 64 Bit Assembler

habran · February 11, 2013, 05:49:30 AM

thanks qWord

qWord · February 11, 2013, 06:23:36 AM

habran,
your compare between C and ASM in your first post is unfair, because is obviously a debug build.
Also, a smarter "algorithm" would probably produce much better results. e.g. something like this (not tested):

Code Select

void* xmemcpy(void *dest, void *src, unsigned int cb)
{
	unsigned int cnt1 = cb>>((sizeof(char*)==8)?3:2);
	unsigned int cnt2 = cb&(sizeof(char*)-1);
	char** p1 = (char**)dest;
	char** p2 = (char**)src;
	char* p3;
	char* p4;

	for(;cnt1--;p1++,p2++)
		*p1 = *p2;
	
	p3 = (char*)p1;
	p4 = (char*)p2;

	if (sizeof(char*) == 8)	// dead code for x32
		if(cnt2&4)
		{	*((int*)p3)= *((int*)p4);
			p3+=4;p4+=4;cnt2-=4;
		}

	for(;cnt2--;p3++,p4++)
		*p3 = *p4;

	return dest;
}

jj2007 · February 11, 2013, 09:07:53 AM

Quote from: qWord on February 11, 2013, 06:23:36 AM
habran,
your compare between C and ASM in your first post is unfair, because is obviously a debug build.
Also, a smarter "algorithm" would probably produce much better results.

I wonder how efficient this code from the "64 beauty" example is (can't test it, unfortunately):

000000014004DA90 dec r8
000000014004DA93 and r8,r8 <<< no need for that, the flag is already set
000000014004DA96 je xmemcpy+22h (14004DA9Ah) <<< why not jne xmemcpy+0Eh?? static branch prediction rules would suggest that it is even faster...
000000014004DA98 jmp xmemcpy+0Eh (14004DA86h) <<< can be dropped entirely.
000000014004DA9A mov rax,r9

Again, timings would be nice ;-)

frktons · February 11, 2013, 10:35:01 AM

For what I recall, a memcopy done with native 64 bit registers, in 64
bit systems, is the fastest solution found when we tested, a couple
of years ago, XMM/SSE2 code for this kind of operation.

The test was done on a 32 MB buffer that was simply blanked, not really a memcopy
but it was set just to measure the performance of REP STOSQ vs MOVNTDQ
and measured via rdtsc.
The results were like:

Quote
Clearing done
117,940,861 clocks for a 33,554,432 bytes buffer with using REP STOSQ

Clearing done
1,208,750,068 clocks for a 33,554,432 bytes buffer with using MOVNTDQ

Code from Alex.

I agree with habran, as I said at the time, for many reasons, but I also
understand why years of work are not easily dropped or rewritten. :t

Frank

Gunther · February 11, 2013, 10:51:18 AM

Frank,

Quote from: frktons on February 11, 2013, 10:35:01 AM
For what I recall, a memcopy done with native 64 bit registers, in 64
bit systems, is the fastest solution found when we tested, a couple
of years ago, XMM/SSE2 code for this kind of operation.

the situation has changed dramatically since the advent of Intel's AVX. We should do the test again.

Gunther

frktons · February 11, 2013, 11:05:20 AM

Quote from: Gunther on February 11, 2013, 10:51:18 AM

the situation has changed dramatically since the advent of Intel's AVX. We should do the test again.

Gunther

I think a new test can only confirm that 64 bit mov operations are
faster than 32 bit ones. If anyone has a new processor, say habran, and
the skill to use AVX code, he could do it.
Not that difficult if he really likes to do the test, I can post the 64 bit MASM
code that I used 2 years ago. No AVX because neither Alex's, nor my PC are
AVX able.

habran · February 11, 2013, 11:35:59 AM

I have tested a speed of 64 bit and result is 1:4 against C
for one pass JWASM is 80 or 50h
and C is 207 intersting :P (JJ2007) 0CFh
JJ207 you are correct that more optimization could be done to it
however my intent in this case was not so much focused on that but on beauty and simplicity of 64 bit JWASM
I have used ".for" loop which is portable, readable and easy to use but it can not beat human eyes and brains

thank you Frank for supporting me that's what friends are for :t

qWord,

Quoteyour compare between C and ASM in your first post is unfair, because is obviously a debug build

all of them are debug built because I needed to read a code in memory :icon_eek:
your function looks good and I will test it later

Quote
however, we can write more complex EG: xxxmemcpy which would be able to calculate the size of data
and than first transfer all possible QWORDS and than if left last DWORD and than if left last WORD and than if left last BYTE
EG: data size is 256+7 EQU 32 QWORDS, 1 DWORD, 1 WORD, and 1 BYTE

I think I have seen already on internet written similar function but I can't remember was it in C or assembler
UNFAIR :icon_eek:
what is fair in this world??? life is a bitch!
these days even death is not fair any more, if you are rich you by yourself brand new organs and live as long as you want :P

habran · February 11, 2013, 11:43:07 AM

hi Frank,
I can try to do that dough I did not learn yet AVX
I am ready for another challenge, I am not a chicken :lol:
I have to go now to earn my living, "I'll be back 8)"

jj2007 · February 11, 2013, 11:48:15 AM

Quote from: habran on February 11, 2013, 11:35:59 AMintersting :P (JJ2007) 0CFh

What do you mean with that? ::)

qWord · February 11, 2013, 12:53:56 PM

Quote from: habran on February 11, 2013, 11:35:59 AM
I have tested a speed of 64 bit and result is 1:4 against C

I can't confirm that: my own quick test shows that there is nearly no difference between your .for-loop and xmemcpy.

Code Select

Function:   xmemcpy    xmemcpy2   xmemcpy_Q  xmemcpy_Q2 memcpy     @ForLoop
 --- buffer size = 13 ---
align +0    29         17         4          2          8          33
align +1    29         17         5          3          10         32
align +2    29         17         7          3          13         32
align +3    29         17         5          2          11         33
align +4    29         17         4          2          11         32
align +5    29         17         5          3          11         32
align +6    29         17         4          3          11         32
align +7    29         17         5          2          11         32
align +8    29         17         4          2          11         32
align +9    29         17         5          3          11         32
align +10   29         18         4          3          11         32
align +11   29         17         5          2          11         32
align +12   29         17         4          2          11         32
align +13   29         17         5          3          11         32
align +14   29         17         4          4          11         32
align +15   29         17         5          3          11         32
 --- buffer size = 33 ---
align +0    93         46         10         10         10         77
align +1    73         46         10         7          10         76
align +2    77         46         10         7          10         76
align +3    92         46         10         7          10         77
align +4    73         46         10         7          10         76
align +5    82         46         10         7          10         76
align +6    73         46         10         7          10         77
align +7    91         47         10         7          10         76
align +8    73         47         11         7          13         76
align +9    73         47         10         7          10         91
align +10   73         46         10         7          10         76
align +11   73         47         10         7          11         77
align +12   76         47         10         7          10         76
align +13   86         47         10         7          10         76
align +14   74         47         10         7          10         76
align +15   84         47         18         8          10         76
 --- buffer size = 59 ---
align +0    124        97         21         14         17         134
align +1    152        98         22         15         17         135
align +2    129        98         23         15         17         134
align +3    129        102        22         15         17         135
align +4    129        98         22         14         17         137
align +5    129        98         21         16         18         139
align +6    128        98         21         16         17         133
align +7    129        98         21         15         17         135
align +8    129        98         21         14         17         134
align +9    128        98         21         14         17         135
align +10   128        98         21         15         17         135
align +11   129        98         20         15         17         134
align +12   129        98         21         14         17         134
align +13   133        98         21         15         17         135
align +14   135        99         21         15         16         135
align +15   127        98         21         15         17         134
 --- buffer size = 590 ---
align +0    920        908        150        123        65         1041
align +1    915        886        149        124        62         1040
align +2    922        906        149        124        82         1048
align +3    925        887        150        124        64         1037
align +4    920        891        150        127        63         1103
align +5    918        892        149        124        64         1042
align +6    974        897        157        128        82         1087
align +7    938        888        149        124        64         1032
align +8    921        889        151        123        65         1070
align +9    941        887        154        124        85         1051
align +10   937        888        150        124        63         1056
align +11   953        887        150        123        64         1013
align +12   920        897        151        124        63         1039
align +13   925        900        150        123        63         1017
align +14   938        892        151        124        63         1090
align +15   927        889        156        125        64         1053

  ---   Functions ----
  xmemcpy    : habran , PellesC
  xmemcpy2   : habran , VC 2012
  xmemcpy_Q  : qWord  , PellesC
  xmemcpy_Q2 : qWord  , VC 2012
  memcpy     : MSVCRT
  @ForLoop   : habran

 only alignment of Src varies, Dest is allocated by HeapAlloc()

Press any key to continue ...

habran · February 11, 2013, 03:17:47 PM

qWord,
I have used counter_begin and counter_end as MACROS like this

Code Select


local buff[256]:BYTE

    counter_begin 1,1
    invoke xmemcpy,ADDR buff,CTEXT("habran is very smart cooker"), 27
    counter_end

and I've got above mentioned results
do you want to say that I lied :icon_eek:

however, I don't believe in your testing because, looking in a C source everyone can see that there is much more
job for processor and also accessing memory in C than ASM

are you sure that your testing is correct
if so I will go back to C64

habran · February 11, 2013, 03:25:23 PM

JJ2007,

QuoteWhat do you mean with that?

207 reminded me on 2007 and it is funny because C knows that you don't like it

BTW 2007 reminded me on two James Bonds or double agent 007
what actually you are doing in Italy? 8)

dedndave · February 11, 2013, 04:05:21 PM

http://csdb.dk/forums/?roomid=11

habran · February 11, 2013, 04:30:28 PM

thanks dedndave :P :
Bye everyone

Gunther · February 11, 2013, 06:25:49 PM

Frank,

Quote from: frktons on February 11, 2013, 11:05:20 AM
I think a new test can only confirm that 64 bit mov operations are
faster than 32 bit ones. If anyone has a new processor, say habran, and
the skill to use AVX code, he could do it.
Not that difficult if he really likes to do the test, I can post the 64 bit MASM
code that I used 2 years ago. No AVX because neither Alex's, nor my PC are
AVX able.

I can do that next weekend; please post your code.

Gunther

The MASM Forum

News:

reason to switch to 64 Bit Assembler

habran

qWord

jj2007

frktons

Gunther

frktons

habran

habran

jj2007

qWord

habran

habran

dedndave

habran

Gunther