C/C++ vs Assembler

jj2007 · May 06, 2013, 06:05:03 PM

Quote from: Antariy on May 06, 2013, 01:23:07 PM
OK, now the test with maximum performance optimization for C code.

The disassembly:

Code Select Expand
_Axhex2dw_C@4: 00000000: 56 push esi 00000001: 8B 74 24 08 mov esi,dword ptr [esp+8] 00000005: 8A 0E mov cl,byte ptr [esi] 00000007: 33 C0 xor eax,eax 00000009: 84 C9 test cl,cl 0000000B: 74 20 je 0000002D 0000000D: 8D 49 00 lea ecx,[ecx] 00000010: 0F BE C9 movsx ecx,cl 00000013: 8B D1 mov edx,ecx 00000015: C1 FA 06 sar edx,6 00000018: 83 E1 0F and ecx,0Fh 0000001B: 8D 14 D2 lea edx,[edx+edx*8] 0000001E: 03 D1 add edx,ecx 00000020: 8A 4E 01 mov cl,byte ptr [esi+1] 00000023: 46 inc esi 00000024: C1 E0 04 shl eax,4 00000027: 03 C2 add eax,edx 00000029: 84 C9 test cl,cl 0000002B: 75 E3 jne 00000010 0000002D: 5E pop esi 0000002E: C2 04 00 ret 4

49 bytes long.

Hi Alex,

I invested "considerable knowledge, skill, and half an hour of my precious time, to get a near-optimal solution in assembly language".

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
39 cycles for Small 1
42 cycles for Small 2
39 cycles for Small 3
39 cycles for Small 3.1
39 cycles for Small 4
46 cycles for C version
38 cycles for C mod JJ

I hope 17% improvement on the optimising C compiler qualifies as a "near-optimal solution" ;-)

Axhex2dw_CJ proc src   ; old C original modified by JJ
   push esi
   mov esi, dword ptr [esp+8]
   movsx ecx, byte ptr [esi]
   xor eax, eax
   jmp @go
;    test ecx, ecx
;    je bye
@@:   ; movsx ecx, cl   ; now superfluous
   mov edx, ecx
   sar edx, 6
   and ecx, 0Fh
   lea edx, [edx+edx*8]
   add edx, ecx
   movsx ecx, byte ptr [esi+1]
   shl eax, 4
   inc esi
   add eax, edx
@go:   test ecx, ecx
   jne @B
bye:   pop esi
   ret 4
Axhex2dw_CJ endp

Manos · May 06, 2013, 06:49:02 PM

Quote from: hutch-- on May 06, 2013, 01:43:35 PM
I did not write the software and nor do I care who likes it or not, it does the job and it is not going to be modified to suit a quirk of that few folks who have not done the work to build a new forum and archive the old one.

The advice has already been given by more active members, make some more posts.

I Know how forums software works and I know to program forum software with VBasic and Java scripts.

When I wrote:
P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!,

Just, I had done a joke.
But some people have not understood my spirit of my words.
If I had the self-exaltation to see my name with stars,
I could post in this forum good morning, good afternoon and good night every day.

Manos.

P.S.
Your old last forum in U.K. was very faster.

anta40 · May 06, 2013, 07:24:03 PM

Quote from: hutch-- on May 06, 2013, 05:25:12 PM
I have an example in mind, a hybrid sort of Robert Sedgewick originally written in C and it was genuinely fast.

Is the code listing available somewhere on the internet?
Or is it available in his book, i.e Algorithms in C ?

Antariy · May 06, 2013, 07:28:52 PM

Hi Jochen :t

In that post I also noted that for "first look" compiler does good job, but some small bits seem to be an ancient assumptions in its optimization techniques. Probably, like Hutch said, this is just a case when the algo is in the its edge of its performance - well, its creation had grown in front of your eyes, you remember

So, this is very good example of Human vs compiler - the same algo => the same inner loop logic in HLL and ASM => the same inner loop code in compiler generated code => and STILL the Human can improve compiler's work for particular task and/or hardware. And you have just demonstrated it :t

For my CPU timings almost do not change (incredible!) - I ran it multiple times, these timings are smallest:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ

43      cycles for Small 1
44      cycles for Small 2
37      cycles for Small 3
47      cycles for Small 3.1
46      cycles for Small 4
79      cycles for C version
72      cycles for C mod JJ

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
46      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

But your CPU obviously does not like longer and superfluous code that generated by the compiler, it still likes Human's code

Interesting how more modern CPUs will run it.

habran · May 06, 2013, 08:06:40 PM

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

19 cycles for Small 1
10 cycles for Small 2
39 cycles for Small 3
38 cycles for Small 3.1
19 cycles for Small 4
26 cycles for C version
19 cycles for C mod JJ

17 cycles for Small 1
17 cycles for Small 2
18 cycles for Small 3
18 cycles for Small 3.1
18 cycles for Small 4
26 cycles for C version
20 cycles for C mod JJ

16 cycles for Small 1
19 cycles for Small 2
19 cycles for Small 3
18 cycles for Small 3.1
18 cycles for Small 4
25 cycles for C version
19 cycles for C mod JJ

48 bytes for Axhex2dw_C2
43 bytes for Axhex2dw_CJ
ABCDEF01 returned
--- ok ---

qWord · May 06, 2013, 09:33:07 PM

Behold and see...

Code Select

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

17      cycles for Small 1
18      cycles for Small 2
19      cycles for Small 3
18      cycles for Small 3.1
19      cycles for Small 4
25      cycles for C version
19      cycles for C mod JJ
18      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
11      hsz2dw2 (unrolled 4 times)

17      cycles for Small 1
23      cycles for Small 2
16      cycles for Small 3
17      cycles for Small 3.1
23      cycles for Small 4
24      cycles for C version
19      cycles for C mod JJ
16      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
13      hsz2dw2 (unrolled 4 times)

20      cycles for Small 1
15      cycles for Small 2
19      cycles for Small 3
19      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
19      cycles for C mod JJ
16      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
13      hsz2dw2 (unrolled 4 times)


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

Code Select

unsigned int hsz2dw2(char* psz){
	const static unsigned char lut[256] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0
										   ,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 };
	
	register unsigned int c;
	register unsigned int r=0;
	register unsigned char* p = (unsigned char*) psz;
	while(1) {
		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;

		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;

		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;

		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;
	}
	return r;
}

jj2007 · May 06, 2013, 11:05:26 PM

Refuses to optimise for AMD ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
39 cycles for Small 1
42 cycles for Small 2
39 cycles for Small 3
39 cycles for Small 3.1
39 cycles for Small 4
46 cycles for C version
38 cycles for C mod JJ
47 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
43 hsz2dw2 (unrolled 4 times)

hutch-- · May 06, 2013, 11:07:31 PM

> I Know how forums software works and I know to program forum software with VBasic and Java scripts.

You would be surprised just how bad it would run on this 64 bit Unix server. The forum is written in PHP, not VBscript and JAVA and NO the forum software will not be modified.

FORTRANS · May 06, 2013, 11:09:58 PM

Hi,

Three more data points for you.

Cheers,

Code Select


 (SSE1)

61	cycles for Small 1
57	cycles for Small 2
59	cycles for Small 3
60	cycles for Small 3.1
60	cycles for Small 4
71	cycles for C version
65	cycles for C mod JJ
57	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
28	hsz2dw2 (unrolled 4 times)

60	cycles for Small 1
59	cycles for Small 2
59	cycles for Small 3
60	cycles for Small 3.1
62	cycles for Small 4
71	cycles for C version
63	cycles for C mod JJ
55	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
29	hsz2dw2 (unrolled 4 times)

60	cycles for Small 1
60	cycles for Small 2
59	cycles for Small 3
60	cycles for Small 3.1
60	cycles for Small 4
71	cycles for C version
63	cycles for C mod JJ
55	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
28	hsz2dw2 (unrolled 4 times)


48	 bytes for Axhex2dw_C2
43	 bytes for Axhex2dw_CJ
ABCDEF01	returned
--- ok ---

120	cycles for Small 1
143	cycles for Small 2
125	cycles for Small 3
132	cycles for Small 3.1
131	cycles for Small 4
117	cycles for C version
107	cycles for C mod JJ
101	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100	hsz2dw2 (unrolled 4 times)

119	cycles for Small 1
141	cycles for Small 2
136	cycles for Small 3
125	cycles for Small 3.1
123	cycles for Small 4
117	cycles for C version
106	cycles for C mod JJ
102	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100	hsz2dw2 (unrolled 4 times)

117	cycles for Small 1
139	cycles for Small 2
130	cycles for Small 3
125	cycles for Small 3.1
127	cycles for Small 4
117	cycles for C version
106	cycles for C mod JJ
102	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100	hsz2dw2 (unrolled 4 times)


48	 bytes for Axhex2dw_C2
43	 bytes for Axhex2dw_CJ
ABCDEF01	returned
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

41	cycles for Small 1
38	cycles for Small 2
45	cycles for Small 3
41	cycles for Small 3.1
41	cycles for Small 4
51	cycles for C version
56	cycles for C mod JJ
31	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24	hsz2dw2 (unrolled 4 times)

41	cycles for Small 1
38	cycles for Small 2
41	cycles for Small 3
41	cycles for Small 3.1
41	cycles for Small 4
51	cycles for C version
50	cycles for C mod JJ
31	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24	hsz2dw2 (unrolled 4 times)

41	cycles for Small 1
38	cycles for Small 2
41	cycles for Small 3
41	cycles for Small 3.1
41	cycles for Small 4
51	cycles for C version
50	cycles for C mod JJ
31	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24	hsz2dw2 (unrolled 4 times)


48	 bytes for Axhex2dw_C2
43	 bytes for Axhex2dw_CJ
ABCDEF01	returned
--- ok ---

Antariy · May 06, 2013, 11:19:20 PM

Hi qWord :t

Yeah, unrolling it is the way to make it faster, but the tested algo is not unrolled - that's the point. It's "classic" more or less small, "looped" code, these characteristics are intentional - that was not a contest but rather a test

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
40      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)

43      cycles for Small 1
45      cycles for Small 2
66      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
37      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)

43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
37      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

Actually, the testbed I posted was a trimmed version I've made some time ago... will post it now - it contains the procs which were in the contest here earlier (I used Axhex2dw just because is the fastest (at least till now) from tests hex2dw procs with the characteristics: case insensitive, does not check input, is looped (i.e. for every digit there is one loop iteration - not unrolled at all), and it looks like it is copyrighted by me, at least no one dispute the rights for ~3 years :lol:)

OK, here is the timings for the archive attached (it is old testbed):

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)



25      cycles for Fast version
27      cycles for Fast version under AMD
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
32      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
27      cycles for Fast version under AMD
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
59      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
30      cycles for Fast version under AMD
43      cycles for Small 1
116     cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
46      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  69
Axhex2dw2 - 2:  48
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    174
krbhtodw:       547
--- ok ---

krbhtodw - the author is Dave (KeepingRealBusy) with minor changes made with his permission - it's the most universal proc - it check the input, it has possibility to process "ignorant chars". It's lookup table.
The fastest GPR code by Jochen (jj2007) - ax_jj_htodw - it's word-indexed lookuptable.

All not "Other's versions" are mine, but when posted in this thread I excluded every not GPR, every unrolled and/or every lookup table based versions. Well, there are new CPUs were released since then, and maybe it's interesting to test all these procs again

Antariy · May 06, 2013, 11:46:31 PM

BTW: Jochen's code is 174 bytes long AND its lookup table is once initialized and does not take the space in the EXE, so it's not only the fastest, but the smallest from unrolled versions at the same time (in the size included the size of hex2dw code + size of table initialization code).

qWord · May 07, 2013, 12:53:42 AM

OK - I obviously missed the "spirit" of this thread...

Gunther · May 07, 2013, 12:54:05 AM

Hi Alex,

here are the timings for your 32Alex's_hex2dw.exe:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

26   cycles for Fast version
29   cycles for Fast version under AMD
50   cycles for Small 1
49   cycles for Small 2
51   cycles for Small 3
54   cycles for Small 3.1
51   cycles for Small 4
14   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
54   cycles for Axhex2dw improved by Hutch (1)
62   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
35   cycles for Dave's version (with minor changes)

29   cycles for Fast version
32   cycles for Fast version under AMD
48   cycles for Small 1
52   cycles for Small 2
54   cycles for Small 3
54   cycles for Small 3.1
50   cycles for Small 4
11   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
65   cycles for Axhex2dw improved by Hutch (1)
60   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
34   cycles for Dave's version (with minor changes)

29   cycles for Fast version
31   cycles for Fast version under AMD
48   cycles for Small 1
53   cycles for Small 2
54   cycles for Small 3
54   cycles for Small 3.1
54   cycles for Small 4
14   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
55   cycles for Axhex2dw improved by Hutch (1)
62   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
34   cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:   396
Axhex2dw_Unrolled_AMD:   396
Axhex2dw1 - 1:   69
Axhex2dw2 - 2:   48
Axhex2dw3 - 3:   57
Axhex2dw3_1 - 3.1:   56
Axhex2dw3 - 4:   61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:   160
Axhex2dw_SSE:   160
Alex_Short_Hutch:   59
Axhex2dw_Hutch2:   54
Hex2dwLingoSSE:   160
lingo_htodw:   1950
ax_jj_htodw:   174
krbhtodw:   547
--- ok ---

dedndave · May 07, 2013, 01:16:18 AM

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

as if the subject has never come up before

jj2007 · May 07, 2013, 02:07:53 AM

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

Hey, your code was actually quite good. Even if your 'piler refuses to optimise for my AMD ;)

The MASM Forum

News:

C/C++ vs Assembler

jj2007

Manos

anta40

Antariy

habran

qWord

jj2007

hutch--

FORTRANS

Antariy

Antariy

qWord

Gunther

dedndave

jj2007