C/C++ vs Assembler

hutch-- · May 05, 2013, 10:51:10 PM

This is the version of "strlen" in my XP version of MSVCRT. It does not look like compiler generated code and it is about 2 years after Agner Fog designed his StrLen algo which is in the MASM32 library.

strlen:
mov ecx, [esp+4]
test ecx, 3
jz lbl1

lbl0:
mov al, [ecx]
inc ecx
test al, al
jz lbl2
test ecx, 3
jnz lbl0
add eax, 0

lbl1:
mov eax, [ecx]
mov edx, 7EFEFEFFh
add edx, eax
xor eax, 0FFFFFFFFh
xor eax, edx
add ecx, 4
test eax, 81010100h
jz lbl1
mov eax, [ecx-4]
test al, al
jz lbl5
test ah, ah
jz lbl4
test eax, 0FF0000h
jz lbl3
test eax, 0FF000000h
jz lbl2
jmp lbl1

lbl2:
lea eax, [ecx-1]
mov ecx, [esp+4]
sub eax, ecx
ret

lbl3:
lea eax, [ecx-2]
mov ecx, [esp+4]
sub eax, ecx
ret

lbl4:
lea eax, [ecx-3]
mov ecx, [esp+4]
sub eax, ecx
ret

lbl5:
lea eax, [ecx-4]
mov ecx, [esp+4]
sub eax, ecx
ret

Gunther · May 06, 2013, 12:16:47 AM

Manos,

Quote from: Manos on May 05, 2013, 10:41:42 PM
Which crt_strlen you used ?

Manos.

Jochen has the source included. I had to repeat the tests again under my 32 bit XP under VirtualPC. Here are the results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 37/10 cycles
1855   cycles for 10 * lstrlen
548   cycles for 10 * StrLen
608   cycles for 10 * crt_strlen
165   cycles for 10 * Len

2492   cycles for 10 * lstrlen
555   cycles for 10 * StrLen
594   cycles for 10 * crt_strlen
164   cycles for 10 * Len

2483   cycles for 10 * lstrlen
547   cycles for 10 * StrLen
606   cycles for 10 * crt_strlen
452   cycles for 10 * Len

100   = eax lstrlen
100   = eax StrLen
100   = eax crt_strlen
100   = eax Len

--- ok ---

Gunther

Manos · May 06, 2013, 01:08:34 AM

Quote from: hutch-- on May 05, 2013, 10:51:10 PM
This is the version of "strlen" in my XP version of MSVCRT. It does not look like compiler generated code and it is about 2 years after Agner Fog designed his StrLen algo which is in the MASM32 library.

Steve,
where is your strlen version ?
I searched this in your library doday, before do my last post for testing, but not found.

The follow is the MS crt version for Intel:

page ,132
title strlen - return the length of a null-terminated string
;***
;strlen.asm - contains strlen() routine
;
; Copyright (c) 1985-1997, Microsoft Corporation. All rights reserved.
;
;Purpose:
; strlen returns the length of a null-terminated string,
; not including the null byte itself.
;
;*******************************************************************************

.xlist
include cruntime.inc
.list

page
;***
;strlen - return the length of a null-terminated string
;
;Purpose:
; Finds the length in bytes of the given string, not including
; the final null character.
;
; Algorithm:
; int strlen (const char * str)
; {
; int length = 0;
;
; while( *str++ )
; ++length;
;
; return( length );
; }
;
;Entry:
; const char * str - string whose length is to be computed
;
;Exit:
; EAX = length of the string "str", exclusive of the final null byte
;
;Uses:
; EAX, ECX, EDX
;
;Exceptions:
;
;*******************************************************************************

CODESEG

public strlen

strlen proc

.FPO ( 0, 1, 0, 0, 0, 0 )

string equ [esp + 4]

mov ecx,string ; ecx -> string
test ecx,3 ; test if string is aligned on 32 bits
je short main_loop

str_misaligned:
; simple byte loop until string is aligned
mov al,byte ptr [ecx]
inc ecx
test al,al
je short byte_3
test ecx,3
jne short str_misaligned

add eax,dword ptr 0 ; 5 byte nop to align label below

align 16 ; should be redundant

main_loop:
mov eax,dword ptr [ecx] ; read 4 bytes
mov edx,7efefeffh
add edx,eax
xor eax,-1
xor eax,edx
add ecx,4
test eax,81010100h
je short main_loop
; found zero byte in the loop
mov eax,[ecx - 4]
test al,al ; is it byte 0
je short byte_0
test ah,ah ; is it byte 1
je short byte_1
test eax,00ff0000h ; is it byte 2
je short byte_2
test eax,0ff000000h ; is it byte 3
je short byte_3
jmp short main_loop ; taken if bits 24-30 are clear and bit
; 31 is set

byte_3:
lea eax,[ecx - 1]
mov ecx,string
sub eax,ecx
ret
byte_2:
lea eax,[ecx - 2]
mov ecx,string
sub eax,ecx
ret
byte_1:
lea eax,[ecx - 3]
mov ecx,string
sub eax,ecx
ret
byte_0:
lea eax,[ecx - 4]
mov ecx,string
sub eax,ecx
ret

strlen endp

end

Manos.

P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!

hutch-- · May 06, 2013, 01:14:07 AM

Manos,

Same plce its always been, strlen.asm in the m32lib directory. It is the version written in 1996 by Agner Fog.

jj2007 · May 06, 2013, 02:52:17 AM

Quote from: hutch-- on May 05, 2013, 10:51:10 PM
This is the version of "strlen" in my XP version of MSVCRT.

OK, included as crt_strlen2 (identical with Manos' "MS crt version for Intel"):
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 17/10 cycles

2677 cycles for 10 * lstrlen
973 cycles for 10 * StrLen
1484 cycles for 10 * crt_strlen
372 cycles for 10 * Len
1348 cycles for 10 * crt_strlen2

2682 cycles for 10 * lstrlen
973 cycles for 10 * StrLen
1464 cycles for 10 * crt_strlen
372 cycles for 10 * Len
1346 cycles for 10 * crt_strlen2

Gunther · May 06, 2013, 03:29:06 AM

Manos,

Quote from: Manos on May 06, 2013, 01:08:34 AM
P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!

that's right, but it has to do with the number of posts you've made. It's the forum software.

Gunther

Manos · May 06, 2013, 04:22:14 AM

Quote from: Gunther on May 06, 2013, 03:29:06 AM
that's right, but it has to do with the number of posts you've made. It's the forum software.

Gunther

Yes, you are right, but in some forums Administrator can change this characteristic.
For example, in my forum I have in Admin control panel this:

Manage groups
From this panel you can administer all your usergroups. You can delete, create and edit existing groups. Furthermore, you may choose group leaders, toggle open/hidden/closed group status and set the group name and description.

User defined groups
These are groups created by you or another admin on this board. You can manage memberships as well as edit group properties or even delete the group.

Manos.

Gunther · May 06, 2013, 07:49:46 AM

Manos,

Quote from: Manos on May 06, 2013, 04:22:14 AM
Yes, you are right, but in some forums Administrator can change this characteristic.

that's clear. On the other hand, writing some more posts and being active in the forum isn't so hard. We're a very lively forum, but it's always good to have experienced and hard working coders like you on the side. :t

Gunther

Antariy · May 06, 2013, 01:23:07 PM

OK, here is test for more or less sophisticated integer code (some of us remember this piece

)

The C code is in Axhex2dw.c file, it linked with an ASM source with timing testbed:

Code Select


INT_PTR __stdcall Axhex2dw_C(char* ptc){
	INT_PTR result=0;
	while(*ptc){
		result=((result<<4)+(*ptc&0xF)+((*ptc>>6)*9));
		++ptc;
	}
	return result;
}

The C code was build with MSVC10, with different optimization settings, the .OBJ file included into archive is one with maximal optimization for performance.

OK, here is the disassembly for the C code optimized to be small:

Code Select


_Axhex2dw_C@4:
  00000000: 8B 54 24 04        mov         edx,dword ptr [esp+4]
  00000004: 8A 0A              mov         cl,byte ptr [edx]
  00000006: 33 C0              xor         eax,eax
  00000008: 84 C9              test        cl,cl
  0000000A: 74 1E              je          0000002A
  0000000C: 56                 push        esi
  0000000D: 0F BE C9           movsx       ecx,cl
  00000010: 8B F1              mov         esi,ecx
  00000012: C1 FE 06           sar         esi,6
  00000015: 6B F6 09           imul        esi,esi,9
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 03 F1              add         esi,ecx
  0000001D: C1 E0 04           shl         eax,4
  00000020: 03 C6              add         eax,esi
  00000022: 42                 inc         edx
  00000023: 8A 0A              mov         cl,byte ptr [edx]
  00000025: 84 C9              test        cl,cl
  00000027: 75 E4              jne         0000000D
  00000029: 5E                 pop         esi
  0000002A: C2 04 00           ret         4

45 bytes long.

The timings:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
100     cycles for C version
51      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
99      cycles for C version
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
99      cycles for C version

--- ok ---

Axhex2dw1 (the "Small 1") is 69 bytes long, Axhex2dw2 (the "Small 2") is 48 bytes long.

OK, now the test with maximum performance optimization for C code.

The disassembly:

Code Select


_Axhex2dw_C@4:
  00000000: 56                 push        esi
  00000001: 8B 74 24 08        mov         esi,dword ptr [esp+8]
  00000005: 8A 0E              mov         cl,byte ptr [esi]
  00000007: 33 C0              xor         eax,eax
  00000009: 84 C9              test        cl,cl
  0000000B: 74 20              je          0000002D
  0000000D: 8D 49 00           lea         ecx,[ecx]
  00000010: 0F BE C9           movsx       ecx,cl
  00000013: 8B D1              mov         edx,ecx
  00000015: C1 FA 06           sar         edx,6
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 8D 14 D2           lea         edx,[edx+edx*8]
  0000001E: 03 D1              add         edx,ecx
  00000020: 8A 4E 01           mov         cl,byte ptr [esi+1]
  00000023: 46                 inc         esi
  00000024: C1 E0 04           shl         eax,4
  00000027: 03 C2              add         eax,edx
  00000029: 84 C9              test        cl,cl
  0000002B: 75 E3              jne         00000010
  0000002D: 5E                 pop         esi
  0000002E: C2 04 00           ret         4

49 bytes long. You may see that the compiler used LEA to multiply EDX by 9 - the algo in C was intentionally written in such a way that optimizing compiler will produce "at first look" the code similar to the handwritten code, i.e. it was written speed-optimized already in HLL, and you can see that inner loop logic is the same as in the handwritten code, so, this time the timings for the C code should probably be very good, but...

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

46      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
51      cycles for Small 4
79      cycles for C version
43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
50      cycles for Small 3.1
43      cycles for Small 4
86      cycles for C version
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version

--- ok ---

The logic is the same (of course, algo is the same), but the implementation is different. So, like I said in my first message in the thread, sophisticated (even this one, that's not really too sophisticated) algos aren't the best things that even optimizing compiler can produce well. Also some things like grabbing a char, then promoting it to dword with two instructions instead of one, then the strange compiler's fear to get the same byte twice from a memory reference, etc - these things make algo slower. Well, being the program, the compiler "writes" very good code, we should agree with that

It's really big and hard people's work behind the short description "Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.30319.01 for 80x86"

It will be very interesting to see the results on different machines, as this time this is also the test "for which machines does Microsoft optimize their compilers?" - i.e. on which machines VC's code performs better - so we can say "on this machine Windows (Word/Photoshop/etc) works faster than on that (with equal CPU freq)! Just because it uses MSVC" :lol:

hutch-- · May 06, 2013, 01:43:35 PM

You don't have to be a genius to know that this forum is DIFFERENT to the last one that was hosted in the UK. After a lot of work I have made an archived version of the old forum available but everybody who is a member of this forum started with a zero post count, me included as the administrator. I did not write the software and nor do I care who likes it or not, it does the job and it is not going to be modified to suit a quirk of that few folks who have not done the work to build a new forum and archive the old one.

The advice has already been given by more active members, make some more posts.

jj2007 · May 06, 2013, 02:44:17 PM

Quote from: Antariy on May 06, 2013, 01:23:07 PM
It will be very interesting to see the results on different machines

Hi Alex,
On paper we have the same machine but results are different:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
41 cycles for Small 1
38 cycles for Small 2
41 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
75 cycles for C version
41 cycles for Small 1
49 cycles for Small 2
41 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
50 cycles for C version
41 cycles for Small 1
38 cycles for Small 2
53 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
51 cycles for C version

The 75 cycles peak is not an accident, it's there for every run I tried.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

39 cycles for Small 1
42 cycles for Small 2
39 cycles for Small 3
39 cycles for Small 3.1
39 cycles for Small 4
44 cycles for C version

habran · May 06, 2013, 02:55:35 PM

Code Select


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

15      cycles for Small 1
16      cycles for Small 2
16      cycles for Small 3
39      cycles for Small 3.1
14      cycles for Small 4
26      cycles for C version
17      cycles for Small 1
16      cycles for Small 2
16      cycles for Small 3
16      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
16      cycles for Small 1
17      cycles for Small 2
18      cycles for Small 3
19      cycles for Small 3.1
18      cycles for Small 4
26      cycles for C version

--- ok ---

sinsi · May 06, 2013, 03:12:23 PM

I wonder how much influence the OS has in these timings too, like whether running 32-bit code on a 64-bit OS has a penalty?

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)

22 cycles for Small 1
23 cycles for Small 2
24 cycles for Small 3
24 cycles for Small 3.1
24 cycles for Small 4
28 cycles for C version
23 cycles for Small 1
24 cycles for Small 2
20 cycles for Small 3
19 cycles for Small 3.1
19 cycles for Small 4
34 cycles for C version
20 cycles for Small 1
20 cycles for Small 2
20 cycles for Small 3
20 cycles for Small 3.1
20 cycles for Small 4
33 cycles for C version

Antariy · May 06, 2013, 03:13:13 PM

Hi, Jochen

I think it's maybe because of first-time pass - code cache did not probably purge and in next tests at the same run your CPU has already ready "food". And it seem that at first time the code with superfluous instructions and not best arranged logic takes much longer to be decoded and executed, than next time it just executed from the code cache. But, speaking of "real life"(tm) app, this means that this code is crazly unoptimal (since in real app it will not be called millions of times consequently, so, the best code is that code which also gets decoded faster... but, again, what means +/- 30 clocks if the proc will be called, let's say, once in a second... that's the true reason why compilers are so popular for "general programming"... but, but, again, the entire proggie consists from these small bits and if every of it will be a bit faster, entire proggie will be faster... well, philosophy is starting here

)

It's interesting also how different tweaks of the assembly versions of Axhex2dw perform. "Small 1" (you remember it, of course

) still looks like more or less stable in the timings. And, even being longer than C version in therms of code size, it still faster.

Hi, habran, thanks for testing it :t

Hi, John, thank you, too

As for influence of the OS - I think you're right, and switching to a 32 bit execution context has a penalty under 64 bit OS, like it is for 16 bit apps under 32 bit OS (but since my x64 experience is small - I cannot say it 100%).

hutch-- · May 06, 2013, 05:25:12 PM

There is an old rule that occasionally applies, if you get a good enough algorithm, then the coding does not matter as much. I have an example in mind, a hybrid sort of Robert Sedgewick originally written in C and it was genuinely fast. I used it as an algorithm to test out a tool I was designing and converted it directly to unoptimised assembler, removed the stack frame, dropped the instruction count, inlined all of the satellite functions and it would not go faster than the C original.

I got a lower better optimised instruction sequence and it stubbornly refused to go faster. It does not happen all that often but it was interesting to see. Basically a perfect algorithm that was highly insensitive to coding technique.

The MASM Forum

News:

C/C++ vs Assembler

hutch--

Gunther

Manos

hutch--

jj2007

Gunther

Manos

Gunther

Antariy

hutch--

jj2007

habran

sinsi

Antariy

hutch--