News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

C/C++ vs Assembler

Started by Manos, May 04, 2013, 04:11:50 AM

Previous topic - Next topic

jj2007

Quote from: Antariy on May 06, 2013, 01:23:07 PM
OK, now the test with maximum performance optimization for C code.

The disassembly:


_Axhex2dw_C@4:
  00000000: 56                 push        esi
  00000001: 8B 74 24 08        mov         esi,dword ptr [esp+8]
  00000005: 8A 0E              mov         cl,byte ptr [esi]
  00000007: 33 C0              xor         eax,eax
  00000009: 84 C9              test        cl,cl
  0000000B: 74 20              je          0000002D
  0000000D: 8D 49 00           lea         ecx,[ecx]
  00000010: 0F BE C9           movsx       ecx,cl
  00000013: 8B D1              mov         edx,ecx
  00000015: C1 FA 06           sar         edx,6
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 8D 14 D2           lea         edx,[edx+edx*8]
  0000001E: 03 D1              add         edx,ecx
  00000020: 8A 4E 01           mov         cl,byte ptr [esi+1]
  00000023: 46                 inc         esi
  00000024: C1 E0 04           shl         eax,4
  00000027: 03 C2              add         eax,edx
  00000029: 84 C9              test        cl,cl
  0000002B: 75 E3              jne         00000010
  0000002D: 5E                 pop         esi
  0000002E: C2 04 00           ret         4


49 bytes long.
Hi Alex,

I invested "considerable knowledge, skill, and half an hour of my precious time, to get a near-optimal solution in assembly language".

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
39      cycles for Small 1
42      cycles for Small 2
39      cycles for Small 3
39      cycles for Small 3.1
39      cycles for Small 4
46      cycles for C version
38      cycles for C mod JJ


I hope 17% improvement on the optimising C compiler qualifies as a "near-optimal solution" ;-)

Axhex2dw_CJ proc src   ; old C original modified by JJ
   push esi
   mov esi, dword ptr [esp+8]
   movsx ecx, byte ptr [esi]
   xor eax, eax
   jmp @go
;    test ecx, ecx
;    je bye
@@:   ; movsx ecx, cl   ; now superfluous
   mov edx, ecx
   sar edx, 6
   and ecx, 0Fh
   lea edx, [edx+edx*8]
   add edx, ecx
   movsx ecx, byte ptr [esi+1]
   shl eax, 4
   inc esi
   add eax, edx
@go:   test ecx, ecx
   jne @B
bye:   pop esi
   ret 4
Axhex2dw_CJ endp

Manos

Quote from: hutch-- on May 06, 2013, 01:43:35 PM
I did not write the software and nor do I care who likes it or not, it does the job and it is not going to be modified to suit a quirk of that few folks who have not done the work to build a new forum and archive the old one.

The advice has already been given by more active members, make some more posts.

I Know how forums software works and I know to program forum software with VBasic and Java scripts.

When I wrote:
P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!
,

Just, I had done a joke.
But some people have not understood my spirit of my words.
If I had the self-exaltation to see my name with stars,
I could post in this forum good morning, good afternoon and good night every day.

Manos.

P.S.
Your old last forum in U.K. was very faster.

anta40

Quote from: hutch-- on May 06, 2013, 05:25:12 PM
I have an example in mind, a hybrid sort of Robert Sedgewick originally written in C and it was genuinely fast.

Is the code listing available somewhere on the internet?
Or is it available in his book, i.e Algorithms in C ?

Antariy

Hi Jochen :t

In that post I also noted that for "first look" compiler does good job, but some small bits seem to be an ancient assumptions in its optimization techniques. Probably, like Hutch said, this is just a case when the algo is in the its edge of its performance - well, its creation had grown in front of your eyes, you remember :biggrin: So, this is very good example of Human vs compiler - the same algo => the same inner loop logic in HLL and ASM => the same inner loop code in compiler generated code => and STILL the Human can improve compiler's work for particular task and/or hardware. And you have just demonstrated it :t

For my CPU timings almost do not change (incredible!) - I ran it multiple times, these timings are smallest:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ

43      cycles for Small 1
44      cycles for Small 2
37      cycles for Small 3
47      cycles for Small 3.1
46      cycles for Small 4
79      cycles for C version
72      cycles for C mod JJ

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
46      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---



But your CPU obviously does not like longer and superfluous code that generated by the compiler, it still likes Human's code :biggrin: Interesting how more modern CPUs will run it.

habran

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

19      cycles for Small 1
10      cycles for Small 2
39      cycles for Small 3
38      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
19      cycles for C mod JJ

17      cycles for Small 1
17      cycles for Small 2
18      cycles for Small 3
18      cycles for Small 3.1
18      cycles for Small 4
26      cycles for C version
20      cycles for C mod JJ

16      cycles for Small 1
19      cycles for Small 2
19      cycles for Small 3
18      cycles for Small 3.1
18      cycles for Small 4
25      cycles for C version
19      cycles for C mod JJ


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---
Cod-Father

qWord

Behold and see...
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

17      cycles for Small 1
18      cycles for Small 2
19      cycles for Small 3
18      cycles for Small 3.1
19      cycles for Small 4
25      cycles for C version
19      cycles for C mod JJ
18      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
11      hsz2dw2 (unrolled 4 times)

17      cycles for Small 1
23      cycles for Small 2
16      cycles for Small 3
17      cycles for Small 3.1
23      cycles for Small 4
24      cycles for C version
19      cycles for C mod JJ
16      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
13      hsz2dw2 (unrolled 4 times)

20      cycles for Small 1
15      cycles for Small 2
19      cycles for Small 3
19      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
19      cycles for C mod JJ
16      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
13      hsz2dw2 (unrolled 4 times)


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

unsigned int hsz2dw2(char* psz){
const static unsigned char lut[256] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0
   ,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
   ,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 };

register unsigned int c;
register unsigned int r=0;
register unsigned char* p = (unsigned char*) psz;
while(1) {
if(!(c=*p))
break;
r<<=4;
r+=lut[c];
p++;

if(!(c=*p))
break;
r<<=4;
r+=lut[c];
p++;

if(!(c=*p))
break;
r<<=4;
r+=lut[c];
p++;

if(!(c=*p))
break;
r<<=4;
r+=lut[c];
p++;
}
return r;
}
:biggrin:
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Refuses to optimise for AMD ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
39      cycles for Small 1
42      cycles for Small 2
39      cycles for Small 3
39      cycles for Small 3.1
39      cycles for Small 4
46      cycles for C version
38      cycles for C mod JJ
47      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
43      hsz2dw2 (unrolled 4 times)

hutch--

 :biggrin:

> I Know how forums software works and I know to program forum software with VBasic and Java scripts.

You would be surprised just how bad it would run on this 64 bit Unix server. The forum is written in PHP, not VBscript and JAVA and NO the forum software will not be modified.

FORTRANS

Hi,

   Three more data points for you.

Cheers,


 (SSE1)

61 cycles for Small 1
57 cycles for Small 2
59 cycles for Small 3
60 cycles for Small 3.1
60 cycles for Small 4
71 cycles for C version
65 cycles for C mod JJ
57 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
28 hsz2dw2 (unrolled 4 times)

60 cycles for Small 1
59 cycles for Small 2
59 cycles for Small 3
60 cycles for Small 3.1
62 cycles for Small 4
71 cycles for C version
63 cycles for C mod JJ
55 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
29 hsz2dw2 (unrolled 4 times)

60 cycles for Small 1
60 cycles for Small 2
59 cycles for Small 3
60 cycles for Small 3.1
60 cycles for Small 4
71 cycles for C version
63 cycles for C mod JJ
55 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
28 hsz2dw2 (unrolled 4 times)


48 bytes for Axhex2dw_C2
43 bytes for Axhex2dw_CJ
ABCDEF01 returned
--- ok ---

120 cycles for Small 1
143 cycles for Small 2
125 cycles for Small 3
132 cycles for Small 3.1
131 cycles for Small 4
117 cycles for C version
107 cycles for C mod JJ
101 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100 hsz2dw2 (unrolled 4 times)

119 cycles for Small 1
141 cycles for Small 2
136 cycles for Small 3
125 cycles for Small 3.1
123 cycles for Small 4
117 cycles for C version
106 cycles for C mod JJ
102 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100 hsz2dw2 (unrolled 4 times)

117 cycles for Small 1
139 cycles for Small 2
130 cycles for Small 3
125 cycles for Small 3.1
127 cycles for Small 4
117 cycles for C version
106 cycles for C mod JJ
102 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100 hsz2dw2 (unrolled 4 times)


48 bytes for Axhex2dw_C2
43 bytes for Axhex2dw_CJ
ABCDEF01 returned
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

41 cycles for Small 1
38 cycles for Small 2
45 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
51 cycles for C version
56 cycles for C mod JJ
31 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24 hsz2dw2 (unrolled 4 times)

41 cycles for Small 1
38 cycles for Small 2
41 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
51 cycles for C version
50 cycles for C mod JJ
31 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24 hsz2dw2 (unrolled 4 times)

41 cycles for Small 1
38 cycles for Small 2
41 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
51 cycles for C version
50 cycles for C mod JJ
31 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24 hsz2dw2 (unrolled 4 times)


48 bytes for Axhex2dw_C2
43 bytes for Axhex2dw_CJ
ABCDEF01 returned
--- ok ---

Antariy

Hi qWord :t

Yeah, unrolling it is the way to make it faster, but the tested algo is not unrolled - that's the point. It's "classic" more or less small, "looped" code, these characteristics are intentional - that was not a contest but rather a test :biggrin:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
40      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)

43      cycles for Small 1
45      cycles for Small 2
66      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
37      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)

43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
37      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---



Actually, the testbed I posted was a trimmed version I've made some time ago... will post it now - it contains the procs which were in the contest here earlier (I used Axhex2dw just because is the fastest (at least till now) from tests hex2dw procs with the characteristics: case insensitive, does not check input, is looped (i.e. for every digit there is one loop iteration - not unrolled at all), and it looks like it is copyrighted by me, at least no one dispute the rights for ~3 years :lol:)

OK, here is the timings for the archive attached (it is old testbed):


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)



25      cycles for Fast version
27      cycles for Fast version under AMD
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
32      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
27      cycles for Fast version under AMD
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
59      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
30      cycles for Fast version under AMD
43      cycles for Small 1
116     cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
46      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  69
Axhex2dw2 - 2:  48
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    174
krbhtodw:       547
--- ok ---


krbhtodw - the author is Dave (KeepingRealBusy) with minor changes made with his permission - it's the most universal proc - it check the input, it has possibility to process "ignorant chars". It's lookup table.
The fastest GPR code by Jochen (jj2007) - ax_jj_htodw - it's word-indexed lookuptable.

All not "Other's versions" are mine, but when posted in this thread I excluded every not GPR, every unrolled and/or every lookup table based versions. Well, there are new CPUs were released since then, and maybe it's interesting to test all these procs again :biggrin:

Antariy

BTW: Jochen's code is 174 bytes long AND its lookup table is once initialized and does not take the space in the EXE, so it's not only the fastest, but the smallest from unrolled versions at the same time (in the size included the size of hex2dw code + size of table initialization code).

qWord

OK - I obviously missed the "spirit" of this thread...
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Hi Alex,

here are the timings for your 32Alex's_hex2dw.exe:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)



26   cycles for Fast version
29   cycles for Fast version under AMD
50   cycles for Small 1
49   cycles for Small 2
51   cycles for Small 3
54   cycles for Small 3.1
51   cycles for Small 4
14   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
54   cycles for Axhex2dw improved by Hutch (1)
62   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
35   cycles for Dave's version (with minor changes)


29   cycles for Fast version
32   cycles for Fast version under AMD
48   cycles for Small 1
52   cycles for Small 2
54   cycles for Small 3
54   cycles for Small 3.1
50   cycles for Small 4
11   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
65   cycles for Axhex2dw improved by Hutch (1)
60   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
34   cycles for Dave's version (with minor changes)


29   cycles for Fast version
31   cycles for Fast version under AMD
48   cycles for Small 1
53   cycles for Small 2
54   cycles for Small 3
54   cycles for Small 3.1
54   cycles for Small 4
14   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
55   cycles for Axhex2dw improved by Hutch (1)
62   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
34   cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:   396
Axhex2dw_Unrolled_AMD:   396
Axhex2dw1 - 1:   69
Axhex2dw2 - 2:   48
Axhex2dw3 - 3:   57
Axhex2dw3_1 - 3.1:   56
Axhex2dw3 - 4:   61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:   160
Axhex2dw_SSE:   160
Alex_Short_Hutch:   59
Axhex2dw_Hutch2:   54
Hex2dwLingoSSE:   160
lingo_htodw:   1950
ax_jj_htodw:   174
krbhtodw:   547
--- ok ---
You have to know the facts before you can distort them.

dedndave

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

:biggrin:  as if the subject has never come up before


jj2007

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

Hey, your code was actually quite good. Even if your 'piler refuses to optimise for my AMD ;)