News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

C/C++ vs Assembler

Started by Manos, May 04, 2013, 04:11:50 AM

Previous topic - Next topic

hutch--

This is the version of "strlen" in my XP version of MSVCRT. It does not look like compiler generated code and it is about 2 years after Agner Fog designed his StrLen algo which is in the MASM32 library.



  strlen:
    mov ecx, [esp+4]
    test ecx, 3
    jz lbl1

  lbl0:
    mov al, [ecx]
    inc ecx
    test al, al
    jz lbl2
    test ecx, 3
    jnz lbl0
    add eax, 0

  lbl1:
    mov eax, [ecx]
    mov edx, 7EFEFEFFh
    add edx, eax
    xor eax, 0FFFFFFFFh
    xor eax, edx
    add ecx, 4
    test eax, 81010100h
    jz lbl1
    mov eax, [ecx-4]
    test al, al
    jz lbl5
    test ah, ah
    jz lbl4
    test eax, 0FF0000h
    jz lbl3
    test eax, 0FF000000h
    jz lbl2
    jmp lbl1

  lbl2:
    lea eax, [ecx-1]
    mov ecx, [esp+4]
    sub eax, ecx
    ret

  lbl3:
    lea eax, [ecx-2]
    mov ecx, [esp+4]
    sub eax, ecx
    ret

  lbl4:
    lea eax, [ecx-3]
    mov ecx, [esp+4]
    sub eax, ecx
    ret

  lbl5:
    lea eax, [ecx-4]
    mov ecx, [esp+4]
    sub eax, ecx
    ret

Gunther

Manos,

Quote from: Manos on May 05, 2013, 10:41:42 PM
Which crt_strlen you used ?

Manos.

Jochen has the source included. I had to repeat the tests again under my 32 bit XP under VirtualPC. Here are the results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 37/10 cycles
1855   cycles for 10 * lstrlen
548   cycles for 10 * StrLen
608   cycles for 10 * crt_strlen
165   cycles for 10 * Len

2492   cycles for 10 * lstrlen
555   cycles for 10 * StrLen
594   cycles for 10 * crt_strlen
164   cycles for 10 * Len

2483   cycles for 10 * lstrlen
547   cycles for 10 * StrLen
606   cycles for 10 * crt_strlen
452   cycles for 10 * Len

100   = eax lstrlen
100   = eax StrLen
100   = eax crt_strlen
100   = eax Len

--- ok ---

Gunther
You have to know the facts before you can distort them.

Manos

Quote from: hutch-- on May 05, 2013, 10:51:10 PM
This is the version of "strlen" in my XP version of MSVCRT. It does not look like compiler generated code and it is about 2 years after Agner Fog designed his StrLen algo which is in the MASM32 library.

Steve,
where is your strlen version ?
I searched this in your library doday, before do my last post for testing, but not found.

The follow is the MS crt version for Intel:

        page    ,132
        title   strlen - return the length of a null-terminated string
;***
;strlen.asm - contains strlen() routine
;
;       Copyright (c) 1985-1997, Microsoft Corporation. All rights reserved.
;
;Purpose:
;       strlen returns the length of a null-terminated string,
;       not including the null byte itself.
;
;*******************************************************************************

        .xlist
        include cruntime.inc
        .list

page
;***
;strlen - return the length of a null-terminated string
;
;Purpose:
;       Finds the length in bytes of the given string, not including
;       the final null character.
;
;       Algorithm:
;       int strlen (const char * str)
;       {
;           int length = 0;
;
;           while( *str++ )
;                   ++length;
;
;           return( length );
;       }
;
;Entry:
;       const char * str - string whose length is to be computed
;
;Exit:
;       EAX = length of the string "str", exclusive of the final null byte
;
;Uses:
;       EAX, ECX, EDX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

        public  strlen

strlen  proc

        .FPO    ( 0, 1, 0, 0, 0, 0 )

string  equ     [esp + 4]

        mov     ecx,string              ; ecx -> string
        test    ecx,3                   ; test if string is aligned on 32 bits
        je      short main_loop

str_misaligned:
        ; simple byte loop until string is aligned
        mov     al,byte ptr [ecx]
        inc     ecx
        test    al,al
        je      short byte_3
        test    ecx,3
        jne     short str_misaligned

        add     eax,dword ptr 0         ; 5 byte nop to align label below

        align   16                      ; should be redundant

main_loop:
        mov     eax,dword ptr [ecx]     ; read 4 bytes
        mov     edx,7efefeffh
        add     edx,eax
        xor     eax,-1
        xor     eax,edx
        add     ecx,4
        test    eax,81010100h
        je      short main_loop
        ; found zero byte in the loop
        mov     eax,[ecx - 4]
        test    al,al                   ; is it byte 0
        je      short byte_0
        test    ah,ah                   ; is it byte 1
        je      short byte_1
        test    eax,00ff0000h           ; is it byte 2
        je      short byte_2
        test    eax,0ff000000h          ; is it byte 3
        je      short byte_3
        jmp     short main_loop         ; taken if bits 24-30 are clear and bit
                                        ; 31 is set

byte_3:
        lea     eax,[ecx - 1]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_2:
        lea     eax,[ecx - 2]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_1:
        lea     eax,[ecx - 3]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_0:
        lea     eax,[ecx - 4]
        mov     ecx,string
        sub     eax,ecx
        ret

strlen  endp

        end


Manos.

P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!

hutch--

Manos,

Same plce its always been, strlen.asm in the m32lib directory. It is the version written in 1996 by Agner Fog.

jj2007

Quote from: hutch-- on May 05, 2013, 10:51:10 PM
This is the version of "strlen" in my XP version of MSVCRT.

OK, included as crt_strlen2 (identical with Manos' "MS crt version for Intel"):
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 17/10 cycles

2677    cycles for 10 * lstrlen
973     cycles for 10 * StrLen
1484    cycles for 10 * crt_strlen
372     cycles for 10 * Len
1348    cycles for 10 * crt_strlen2

2682    cycles for 10 * lstrlen
973     cycles for 10 * StrLen
1464    cycles for 10 * crt_strlen
372     cycles for 10 * Len
1346    cycles for 10 * crt_strlen2

Gunther

Manos,

Quote from: Manos on May 06, 2013, 01:08:34 AM
P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!

that's right, but it has to do with the number of posts you've made. It's the forum software.

Gunther
You have to know the facts before you can distort them.

Manos

Quote from: Gunther on May 06, 2013, 03:29:06 AM
that's right, but it has to do with the number of posts you've made. It's the forum software.

Gunther

Yes, you are right, but in some forums Administrator can change this characteristic.
For example, in my forum I have in Admin control panel  this:

Manage groups
From this panel you can administer all your usergroups. You can delete, create and edit existing groups. Furthermore, you may choose group leaders, toggle open/hidden/closed group status and set the group name and description.

User defined groups
These are groups created by you or another admin on this board. You can manage memberships as well as edit group properties or even delete the group.


Manos.

Gunther

Manos,

Quote from: Manos on May 06, 2013, 04:22:14 AM
Yes, you are right, but in some forums Administrator can change this characteristic.

that's clear. On the other hand, writing some more posts and being active in the forum isn't so hard. We're a very lively forum, but it's always good to have experienced and hard working coders like you on the side.  :t

Gunther
You have to know the facts before you can distort them.

Antariy

OK, here is test for more or less sophisticated integer code (some of us remember this piece :biggrin:)

The C code is in Axhex2dw.c file, it linked with an ASM source with timing testbed:


INT_PTR __stdcall Axhex2dw_C(char* ptc){
INT_PTR result=0;
while(*ptc){
result=((result<<4)+(*ptc&0xF)+((*ptc>>6)*9));
++ptc;
}
return result;
}


The C code was build with MSVC10, with different optimization settings, the .OBJ file included into archive is one with maximal optimization for performance.

OK, here is the disassembly for the C code optimized to be small:


_Axhex2dw_C@4:
  00000000: 8B 54 24 04        mov         edx,dword ptr [esp+4]
  00000004: 8A 0A              mov         cl,byte ptr [edx]
  00000006: 33 C0              xor         eax,eax
  00000008: 84 C9              test        cl,cl
  0000000A: 74 1E              je          0000002A
  0000000C: 56                 push        esi
  0000000D: 0F BE C9           movsx       ecx,cl
  00000010: 8B F1              mov         esi,ecx
  00000012: C1 FE 06           sar         esi,6
  00000015: 6B F6 09           imul        esi,esi,9
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 03 F1              add         esi,ecx
  0000001D: C1 E0 04           shl         eax,4
  00000020: 03 C6              add         eax,esi
  00000022: 42                 inc         edx
  00000023: 8A 0A              mov         cl,byte ptr [edx]
  00000025: 84 C9              test        cl,cl
  00000027: 75 E4              jne         0000000D
  00000029: 5E                 pop         esi
  0000002A: C2 04 00           ret         4


45 bytes long.

The timings:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
100     cycles for C version
51      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
99      cycles for C version
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
99      cycles for C version

--- ok ---


Axhex2dw1 (the "Small 1") is 69 bytes long, Axhex2dw2 (the "Small 2") is 48 bytes long.




OK, now the test with maximum performance optimization for C code.

The disassembly:


_Axhex2dw_C@4:
  00000000: 56                 push        esi
  00000001: 8B 74 24 08        mov         esi,dword ptr [esp+8]
  00000005: 8A 0E              mov         cl,byte ptr [esi]
  00000007: 33 C0              xor         eax,eax
  00000009: 84 C9              test        cl,cl
  0000000B: 74 20              je          0000002D
  0000000D: 8D 49 00           lea         ecx,[ecx]
  00000010: 0F BE C9           movsx       ecx,cl
  00000013: 8B D1              mov         edx,ecx
  00000015: C1 FA 06           sar         edx,6
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 8D 14 D2           lea         edx,[edx+edx*8]
  0000001E: 03 D1              add         edx,ecx
  00000020: 8A 4E 01           mov         cl,byte ptr [esi+1]
  00000023: 46                 inc         esi
  00000024: C1 E0 04           shl         eax,4
  00000027: 03 C2              add         eax,edx
  00000029: 84 C9              test        cl,cl
  0000002B: 75 E3              jne         00000010
  0000002D: 5E                 pop         esi
  0000002E: C2 04 00           ret         4



49 bytes long. You may see that the compiler used LEA to multiply EDX by 9 - the algo in C was intentionally written in such a way that optimizing compiler will produce "at first look" the code similar to the handwritten code, i.e. it was written speed-optimized already in HLL, and you can see that inner loop logic is the same as in the handwritten code, so, this time the timings for the C code should probably be very good, but...


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

46      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
51      cycles for Small 4
79      cycles for C version
43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
50      cycles for Small 3.1
43      cycles for Small 4
86      cycles for C version
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version

--- ok ---





The logic is the same (of course, algo is the same), but the implementation is different. So, like I said in my first message in the thread, sophisticated (even this one, that's not really too sophisticated) algos aren't the best things that even optimizing compiler can produce well. Also some things like grabbing a char, then promoting it to dword with two instructions instead of one, then the strange compiler's fear to get the same byte twice from a memory reference, etc - these things make algo slower. Well, being the program, the compiler "writes" very good code, we should agree with that :biggrin: It's really big and hard people's work behind the short description "Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.30319.01 for 80x86" :eusa_clap:


It will be very interesting to see the results on different machines, as this time this is also the test "for which machines does Microsoft optimize their compilers?" - i.e. on which machines VC's code performs better - so we can say "on this machine Windows (Word/Photoshop/etc) works faster than on that (with equal CPU freq)! Just because it uses MSVC" :lol:

hutch--

You don't have to be a genius to know that this forum is DIFFERENT to the last one that was hosted in the UK. After a lot of work I have made an archived version of the old forum available but everybody who is a member of this forum started with a zero post count, me included as the administrator. I did not write the software and nor do I care who likes it or not, it does the job and it is not going to be modified to suit a quirk of that few folks who have not done the work to build a new forum and archive the old one.

The advice has already been given by more active members, make some more posts.

jj2007

#40
Quote from: Antariy on May 06, 2013, 01:23:07 PM
It will be very interesting to see the results on different machines

Hi Alex,
On paper we have the same machine but results are different:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
41      cycles for Small 1
38      cycles for Small 2
41      cycles for Small 3
41      cycles for Small 3.1
41      cycles for Small 4
75      cycles for C version
41      cycles for Small 1
49      cycles for Small 2
41      cycles for Small 3
41      cycles for Small 3.1
41      cycles for Small 4
50      cycles for C version
41      cycles for Small 1
38      cycles for Small 2
53      cycles for Small 3
41      cycles for Small 3.1
41      cycles for Small 4
51      cycles for C version


The 75 cycles peak is not an accident, it's there for every run I tried.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

39      cycles for Small 1
42      cycles for Small 2
39      cycles for Small 3
39      cycles for Small 3.1
39      cycles for Small 4
44      cycles for C version

habran


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

15      cycles for Small 1
16      cycles for Small 2
16      cycles for Small 3
39      cycles for Small 3.1
14      cycles for Small 4
26      cycles for C version
17      cycles for Small 1
16      cycles for Small 2
16      cycles for Small 3
16      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
16      cycles for Small 1
17      cycles for Small 2
18      cycles for Small 3
19      cycles for Small 3.1
18      cycles for Small 4
26      cycles for C version

--- ok ---
Cod-Father

sinsi

I wonder how much influence the OS has in these timings too, like whether running 32-bit code on a 64-bit OS has a penalty?


Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)

22      cycles for Small 1
23      cycles for Small 2
24      cycles for Small 3
24      cycles for Small 3.1
24      cycles for Small 4
28      cycles for C version
23      cycles for Small 1
24      cycles for Small 2
20      cycles for Small 3
19      cycles for Small 3.1
19      cycles for Small 4
34      cycles for C version
20      cycles for Small 1
20      cycles for Small 2
20      cycles for Small 3
20      cycles for Small 3.1
20      cycles for Small 4
33      cycles for C version


Antariy

Hi, Jochen :biggrin:

I think it's maybe because of first-time pass - code cache did not probably purge and in next tests at the same run your CPU has already ready "food". And it seem that at first time the code with superfluous instructions and not best arranged logic takes much longer to be decoded and executed, than next time it just executed from the code cache. But, speaking of "real life"(tm) app, this means that this code is crazly unoptimal (since in real app it will not be called millions of times consequently, so, the best code is that code which also gets decoded faster... but, again, what means +/- 30 clocks if the proc will be called, let's say, once in a second... that's the true reason why compilers are so popular for "general programming"... but, but, again, the entire proggie consists from these small bits and if every of it will be a bit faster, entire proggie will be faster... well, philosophy is starting here :greensml:)

It's interesting also how different tweaks of the assembly versions of Axhex2dw perform. "Small 1" (you remember it, of course :biggrin:) still looks like more or less stable in the timings. And, even being longer than C version in therms of code size, it still faster.

Hi, habran, thanks for testing it :t

Hi, John, thank you, too :biggrin: As for influence of the OS - I think you're right, and switching to a 32 bit execution context has a penalty under 64 bit OS, like it is for 16 bit apps under 32 bit OS (but since my x64 experience is small - I cannot say it 100%).

hutch--

There is an old rule that occasionally applies, if you get a good enough algorithm, then the coding does not matter as much. I have an example in mind, a hybrid sort of Robert Sedgewick originally written in C and it was genuinely fast. I used it as an algorithm to test out a tool I was designing and converted it directly to unoptimised assembler, removed the stack frame, dropped the instruction count, inlined all of the satellite functions and it would not go faster than the C original.

I got a lower better optimised instruction sequence and it stubbornly refused to go faster. It does not happen all that often but it was interesting to see. Basically a perfect algorithm that was highly insensitive to coding technique.