Print Page - C/C++ vs Assembler

Title: C/C++ vs Assembler
Post by: Manos on May 04, 2013, 04:11:50 AM

Hi all.

It is true that code written in assembly is faster than code
written in any HiLevel language.
Someday, in foretime, I read in this forum that some source
was faster when was written in C/C++ than in assembly.
Why ?
The answer is below.

These days I read in the Web the follow:

..........................
I wrote this function in C++, assembly (in-line) and assembly (MASM).

Here is the C++ Code:

char cppToUpper(char c)
{
if (c > 122 || c < 97 )
return c;
else return c - 32;
}

Here is the inline assembly Code:

char cToUpper(int c)
{
//
//cout << cLowerLimit;
_asm
{
//Copy the character onto the arithmetic register for single bytes
mov eax, c;
//Test the Upper Limit
cmp eax, 122; // Compare the Character to 122
ja End; // Jump to the end if above -- the character is too high to be a lower case letter
//Test the lower limit
cmp eax, 97 //Compare the character to 97
jb End; // Jump to the end if below == the character is too low to be a lower case letter
//Now the operation begins
sub eax, 32; //Subtract 32 from the character in the register
End:
// mov result, al; //Move the Character in the register into the result variable
}
}

And here is the function in pure assembly language:

.686
.model flat, stdcall
option casemap :none
.code
cUpperCase2 proc cValue:DWORD
mov eax, cValue
cmp eax, 122
ja TEnd
cmp eax, 97
jb TEnd
sub eax, 32
TEnd:
ret
cUpperCase2 endp
end

Now, here is what the C++ function disassembles to:

char cppToUpper(char c)
{
01271680 push ebp
01271681 mov ebp,esp
01271683 sub esp,0C0h
01271689 push ebx
0127168A push esi
0127168B push edi
0127168C lea edi,[ebp-0C0h]
01271692 mov ecx,30h
01271697 mov eax,0CCCCCCCCh
0127169C rep stos dword ptr es:[edi]
if (c > 122 || c < 97 )
0127169E movsx eax,byte ptr [c]
012716A2 cmp eax,7Ah
012716A5 jg cppToUpper+30h (12716B0h)
012716A7 movsx eax,byte ptr [c]
012716AB cmp eax,61h
012716AE jge cppToUpper+37h (12716B7h)
return c;
012716B0 mov al,byte ptr [c]
012716B3 jmp cppToUpper+3Eh (12716BEh)
012716B5 jmp cppToUpper+3Eh (12716BEh)
else return c - 32;
012716B7 movsx eax,byte ptr [c]
012716BB sub eax,20h
}
012716BE pop edi
012716BF pop esi
012716C0 pop ebx
012716C1 mov esp,ebp
012716C3 pop ebp
012716C4 ret

ALRIGHT HERE'S the question. Why is the C++ code considerably faster even though it compiles to far more instructions than my assembly language code uses? 48 "ticks" expire when executing the pure assembly language function 10,000,000 times (I'll put this stuff at the very bottom); 0 ticks when executing it in C++, and 16 when using inline assembly?

I am impressed that I was even able to get it to work in assembly but perplexed at the performance results. I'll put the main() function below along with the efficiency timing stuff for your reference.
Any ideas? I am just trying to learn a little assembly because I am curious about how computers actually work.

#include "stdafx.h"
#include <iostream>
#include <string>
#include "windows.h"
#include "time.h"
using namespace std;
extern "C" int _stdcall cUpperCase2(char c);
class stopwatch
{
public:
stopwatch() : start(clock()){} //start counting time
~stopwatch();
private:
clock_t start;
};
stopwatch::~stopwatch()
{
clock_t total = clock()-start; //get elapsed time
cout<<"total of ticks for this activity: "<<total<<endl;
cout <<"in seconds: "<< double(total/CLK_TCK) <<endl;
}
void main()
{
bool bAgain = true;
while (bAgain)
{
// unsigned long lTimeNow = t_time;
char c = 'a';
char d = '!';
char e;
//cout << "A lowercase character will be converted to Uppercase:" << endl;
//cin >> c;
{
stopwatch watch;
for (int i=0; i < 10000000; i++)
{
e= cUpperCase2(c);
e= cUpperCase2(d);
}
}
cout << "That was the external function written in assembler." << endl;
{
stopwatch watch;
for (int i=0; i < 10000000; i++)
{
// cout << cppToUpper(c);
//cout << cppToUpper(d);
e= cppToUpper(c);
e= cppToUpper(d);
}
}
cout << "That was C++\n";
{
stopwatch watch;
for (int i=0; i < 10000000; i++)
{
e= cToUpper(c);
e= cToUpper(d);
}
}
cout << "That was in line assembler\n";
cout << "Enter a letter and hit enter to exit (a will repeat) . . ." << endl;
cin >> c;
//return 0;
if (c != 'a')
bAgain = false;
}

My answer is that C/C++ compiler never executes the above loop when call the cppToUpper function.
This is because C/C++ compiler knows at compile time the result and put the result without execute the loop.
This called optimize.
But if in the above loop put below the call e= cToUpper(d);
the follow:
if(i > 10000000)
break;
the C/C++ will execut the loop and the result is different.
The conclusion is that sometimes the C/C++ is faster.

Manos.

Title: Re: C/C++ vs Assembler
Post by: qWord on May 04, 2013, 04:47:38 AM

For fairness, please use a release build, turn all optimization and use a realistic function/algorithm with real-word-data. If the function's input depends on some runtime-input (e.g. command line or user input), the compiler can't remove the function as it did in your test. Also remarks that the compiler maybe inline your code.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 04, 2013, 04:56:08 AM

Quote from: Manos on May 04, 2013, 04:11:50 AM
These days I read in the Web the follow:

In the DaniWeb (http://www.daniweb.com/software-development/assembly/threads/155136/assembly-vs.-c-performance)? Rarely seen so many confused people in one thread :P

When comparing reasonable C++ and assembly code, they are often equally fast.

If the C++ code is more than 5% faster, go and check if it does not eliminate some steps by guessing that two constants can be condensed into one (compilers can be clever 8))
No problem - do the same in assembler, and you are back on par.

If, on the other hand, you believe the C++ code is not fast enough, then disassemble the innermost loop and trim it with hand-made assembly. Depending on the quality of the initial code, improvements between 10% and a factor 10 are always possible. Search the Laboratory for the word timings to see what's feasible. Many good algos have been written "against" the C Runtime Library, which is probably one of the libraries that have been "beaten to death" by M$ programmers to tickle the last bit of performance out of Windows. And voilà, assembly is still often a factor 2 or 3 faster. Ask Lingo (http://www.masmforum.com/board/index.php?topic=13701.msg107596#msg107596) if you can find him ;)

Title: Re: C/C++ vs Assembler
Post by: Adamanteus on May 04, 2013, 06:26:42 AM

Main goal of this topic is not absolutely correct, because you are really discussed C-code, shown that it with prolog and epilog and so on, that on assembly is discarding - correct opinion, assembly is more efficient than C is proved be experience, but C++ compilers often have more high level of optimisation, and underscore before asm keyword are showing, that it's giving system depended results - not language.

Title: Re: C/C++ vs Assembler
Post by: anunitu on May 04, 2013, 07:14:03 AM

Hoping this isn't going to turn into a flame war about different languages,seen way to may of those. :dazzled:

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 04, 2013, 09:11:05 AM

You tend to do comparisons like this by writing the task in both languages and comparing them in a benchmark, combining the two formats in a compiler does justice to neither, the inline assembler messes up the compiler optimisation and the C compiler generally uses registers in non-standard ways which increases the overhead of calling an inline assembler routine.

Microsoft have long had the solution, write your C/C++ in a C compiler and write your assembler code with an assembler, then LINK them together and you get the bast of both worlds, not the worst.

The next factor is its easy to write both lousy C and lousy assembler, if you are going to make comparisons you need to benchmark both to locally optimise each one. THEN do the comparison.

Title: Re: C/C++ vs Assembler
Post by: dedndave on May 04, 2013, 01:01:11 PM

i suspect that, for some rare cases, you can implement some things faster in assembler
also, in C, some things are easier, like COM and .NET, etc - and C is more maintainable
otherwise - it's a good design/bad design thing, as Hutch says

use the right tool for the job

now, in my case, i am not very proficient in C
and - i don't write code for a living, either
i prefer assembler and write in assembler

Title: Re: C/C++ vs Assembler
Post by: MichaelW on May 04, 2013, 03:36:45 PM

This code eliminates the C++ stuff, adds some function test code (currently commented out), adds a test of the CRT function, adds a naked function, and removes the prologue and epilogue from the external assembly procedure.

Code Select


//=============================================================================
#include <windows.h>
#include <conio.h>
#include <stdio.h>
#include "counter_c.c"
//=============================================================================
// These for safety on single-core systems.
#define PP HIGH_PRIORITY_CLASS
#define TP THREAD_PRIORITY_NORMAL
// These for multi-core systems.
//#define PP REALTIME_PRIORITY_CLASS
//#define TP THREAD_PRIORITY_TIME_CRITICAL

#define LOOPS 10000000
//=============================================================================

int c_toupper(int c)
{
  if (c > 122 || c < 97)
    return c;
  else
    return c - 32;
}

char ia_toupper(int c)
{
  __asm
  {
    mov eax, c
    cmp eax, 122
    ja  end
    cmp eax, 97
    jb  end
    sub eax, 32
  end:
  }
}

__declspec(naked) int nk_toupper(int c)
{
  __asm
  {
    mov eax, [esp+4]
    cmp eax, 122
    ja  end
    cmp eax, 97
    jb  end
    sub eax, 32
  end:
    ret
  }
}

//----------------------------------------------------------------------------
// This is necessary to prevent the optimizer from breaking the counter code.
//----------------------------------------------------------------------------

#pragma optimize("",off)

int asm_toupper(int c);

void main(void)
{
  int i, c;
  /*
  for(i=0;i<200;i++)
  {
    c = rand() >> 8;
    printf("%c",toupper(c));
    printf("%c",c_toupper(c));
    printf("%c",ia_toupper(c));
    printf("%c",nk_toupper(c));
    printf("%c",asm_toupper(c));
    printf("\n");
  }
  */

  SetProcessAffinityMask(GetCurrentProcess(),1);

  Sleep(5000);

  for(i=0;i<4;i++)
  {
    counter_begin(1,LOOPS,PP,TP);
    counter_end(1)
    printf( "%d cycles, empty\n", counter_cycles );
    counter_begin(2,LOOPS,PP,TP);
      c = toupper(95);
      c = toupper(110);
      c = toupper(125);
    counter_end(2)
    printf( "%d cycles, toupper\n", counter_cycles );
    counter_begin(3,LOOPS,PP,TP);
      c = c_toupper(95);
      c = c_toupper(110);
      c = c_toupper(125);
    counter_end(3)
    printf( "%d cycles, c_toupper\n", counter_cycles );
    counter_begin(4,LOOPS,PP,TP);
      c = ia_toupper(95);
      c = ia_toupper(110);
      c = ia_toupper(125);
    counter_end(4)
    printf( "%d cycles, ia_toupper\n", counter_cycles );
    counter_begin(5,LOOPS,PP,TP);
      c = nk_toupper(95);
      c = nk_toupper(110);
      c = nk_toupper(125);
    counter_end(5)
    printf( "%d cycles, nk_toupper\n", counter_cycles );
    counter_begin(6,LOOPS,PP,TP);
      c = asm_toupper(95);
      c = asm_toupper(110);
      c = asm_toupper(125);
    counter_end(6)
    printf( "%d cycles, asm_toupper\n\n", counter_cycles );
  }
  getch();
}

#pragma optimize("",on)

Results on my P3:

Code Select


0 cycles, empty
49 cycles, toupper
20 cycles, c_toupper
26 cycles, ia_toupper
20 cycles, nk_toupper
20 cycles, asm_toupper

0 cycles, empty
49 cycles, toupper
20 cycles, c_toupper
27 cycles, ia_toupper
20 cycles, nk_toupper
20 cycles, asm_toupper

0 cycles, empty
49 cycles, toupper
20 cycles, c_toupper
27 cycles, ia_toupper
20 cycles, nk_toupper
20 cycles, asm_toupper

0 cycles, empty
49 cycles, toupper
20 cycles, c_toupper
27 cycles, ia_toupper
20 cycles, nk_toupper
20 cycles, asm_toupper

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 04, 2013, 04:33:17 PM

This is the result on my Core2 quad. (3 gig)

Code Select



0 cycles, empty
31 cycles, toupper
52 cycles, c_toupper
63 cycles, ia_toupper
39 cycles, nk_toupper
30 cycles, asm_toupper

0 cycles, empty
31 cycles, toupper
44 cycles, c_toupper
63 cycles, ia_toupper
32 cycles, nk_toupper
31 cycles, asm_toupper

0 cycles, empty
31 cycles, toupper
48 cycles, c_toupper
63 cycles, ia_toupper
41 cycles, nk_toupper
29 cycles, asm_toupper

0 cycles, empty
31 cycles, toupper
57 cycles, c_toupper
61 cycles, ia_toupper
41 cycles, nk_toupper
30 cycles, asm_toupper

Title: Re: C/C++ vs Assembler
Post by: Manos on May 04, 2013, 06:27:15 PM

Some people have not understood my spirit of my words.

Assembly is the best for small programs and for writtng APIs, libraries and drivers.
But if you attempt to write a big program like my IDE, you will spend a ton of time.
And a question: Why MASM, JWASM, POASM, NASM and Windows are written in C language ?

In my MSDN for VStudio 6.0, Microsoft writes:
C Language Reference
The C language is a general-purpose programming language known for its efficiency, economy, and portability. While these characteristics make it a good choice for almost any kind of programming, C has proven especially useful in systems programming because it facilitates writing fast, compact programs that are readily adaptable to other systems. Well-written C programs are often as fast as assembly-language programs, and they are typically easier for programmers to read and maintain.

And a little words to jj2007
.
You have no the right to call Microsoft as M$.
If you don't like Microsoft, turn to Linux.

Manos.

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 04, 2013, 07:30:26 PM

Hi Manos,

Quote from: Manos on May 04, 2013, 06:27:15 PM
In my MSDN for VStudio 6.0, Microsoft writes:
C Language Reference
The C language is a general-purpose programming language known for its efficiency, economy, and portability. While these characteristics make it a good choice for almost any kind of programming, C has proven especially useful in systems programming because it facilitates writing fast, compact programs that are readily adaptable to other systems. Well-written C programs are often as fast as assembly-language programs, and they are typically easier for programmers to read and maintain.

and is it right, because Microsoft wrote it? Do you really believe that?

Quote from: Manos on May 04, 2013, 06:27:15 PM
And a question: Why MASM, JWASM, POASM, NASM and Windows are written in C language ?

The answer is easy: they are written in C for better maintenance, but that has nothing to do with performance. There are on the other hand good assemblers (FASM, SolAsm) which are written in assembly language. Furthermore, there are compilers written in assembly language, too. You should, for example, have a look into that thread (http://masm32.com/board/index.php?topic=964.0), especially reply #6.

Gunther

Title: Re: C/C++ vs Assembler
Post by: qWord on May 04, 2013, 09:48:56 PM

Quote from: Gunther on May 04, 2013, 07:30:26 PM
Quote from: Manos on May 04, 2013, 06:27:15 PM
In my MSDN for VStudio 6.0, Microsoft writes:
C Language Reference
The C language is a general-purpose programming language known for its efficiency, economy, and portability. While these characteristics make it a good choice for almost any kind of programming, C has proven especially useful in systems programming because it facilitates writing fast, compact programs that are readily adaptable to other systems. Well-written C programs are often as fast as assembly-language programs, and they are typically easier for programmers to read and maintain.

and is it right, because Microsoft wrote it? Do you really believe that?

So, it's wrong because MS wrote it?
I would accept it with a few modifications:

QuoteThe C language is a general-purpose programming language known for its efficiency^(?), ~~economy~~, and portability. ~~While these characteristics make it a good choice for almost any kind of programming,~~ C has proven especially useful in systems programming because it facilitates writing fast, compact programs that are readily adaptable to other systems. Well-written C programs are often as fast as assembly-language programs, and they are typically easier for programmers to read and maintain.

Quote from: Manos on May 04, 2013, 06:27:15 PMAnd a little words to jj2007
.
You have no the right to call Microsoft as M$.

of course he has the right!

Title: Re: C/C++ vs Assembler
Post by: dedndave on May 04, 2013, 10:34:25 PM

the assembler program is probably smaller than the hll one :P

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 04, 2013, 10:34:36 PM

Hi qWord,

Quote from: qWord on May 04, 2013, 09:48:56 PM
So, it's wrong because MS wrote it?

I would say: yes, because you made a lot of changes to the original statement to accept it, didn't you? 8)

Quote from: qWord on May 04, 2013, 09:48:56 PM
of course he has the right!

Without any doubt. :t

Gunther

Title: Re: C/C++ vs Assembler
Post by: anunitu on May 04, 2013, 11:27:21 PM

You do know Microsoft is not a religion right?..really one could say anything about MS and not be burned alive on a stack if motherboards.

Title: Re: C/C++ vs Assembler
Post by: MichaelW on May 05, 2013, 12:29:40 AM

Quote from: dedndave on May 04, 2013, 10:34:25 PM
the assembler program is probably smaller than the hll one :P

And if the C program was done in plain, straightforward C it was probably easier and faster to code and less likely to contain bugs/errors.

Title: Re: C/C++ vs Assembler
Post by: Manos on May 05, 2013, 12:46:35 AM

Quote from: anunitu on May 04, 2013, 11:27:21 PM
You do know Microsoft is not a religion right?..really one could say anything about MS and not be burned alive on a stack if motherboards.

A serious person never speak ironicaly for other persons, companies or works.
If someone does not like a company or a project, simply don't use this.
But some people behave sometimes like juniors in high school.
This forum is supposed to be about talking for programming, neither for speaking ironicaly nor for attackes.

Manos.

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 05, 2013, 12:47:20 AM

Its basically the case that what you are familiar with will in part dictate what you program in. The last rewrite of my editor was completely in MASM and I saved more than 2 weeks in writing time over the earlier version. It is just on 200k of assembler code for the bare editor, then there is the code for the dedicated DLLs which is a mix of HLL and assembler, the scripting engine is almost exclusively assembler using the dynamic string of basic for technical reasons.

That none of it is all that big is because it IS written in assembler and much of the speed gain in writing better code was because it was written without the irritations of higher level languages. Its typically the case of being front end unfriendly but very back end friendly.

The main gain writing in C is portability but only if you are using fully portable libraries with no API code. Once you are committed to a specific operating system with Windows API functions, MASM code is just as fast to write as any higher level language that uses the same functions and it is free of the irritations and restrictions of higher level languages.

VC98 is antique junk and the assumptions from that era are no longer sound, even if they were back then, the odd bits of C the I compile these days is usually in VC2003 as it was both a better compiler and better linker with a greater range of libraries.

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 05, 2013, 01:35:06 AM

First link in the signature field of this message.
Pure hardcore C++ written to be compilable for both 32 bit and 64 bit platforms without change of single line of code.
The question of the EXE file size in this case is a question of lazyness, and, yes, written in assembly with the same functionality it will not be anyhow much smaller. Aligned by section size EXE will not be smaller at all.

But yes, it is not the anyhow complex program, and, what is more important, it uses pretty linear logic which uses integer arithmetic.

There, in the field of the code which is easy "serializable", like simple integer math, control flow structures, calltables, jumptables and other similar stuff are, which do have equal translation from the HLL to the machine code, the C optimizing compiler-generated code is not slower and is not "bigger" than the best possible ASM code. But, still, FPU code, tricky and/or sophisticated integer code, SSE code and so on - is the kingdom of ASM.

Sometimes the question of "what language to use: C or ASM" is just a matter of design, development time for given particular project and of course lazyness :greensml:

Jochen, :t

Title: Re: C/C++ vs Assembler
Post by: anta40 on May 05, 2013, 01:42:52 AM

Quote from: Manos on May 04, 2013, 06:27:15 PM
And a question: Why MASM, JWASM, POASM, NASM and Windows are written in C language ?

Probably this has something to do with the UNIX culture.
In the 1970s, those Bell Lab folks (Dennis Ritchie, Ken Thompson, et al) needed a language than asm to rewrite UNIX. Hence, C was born.
They used C not only for the UNIX kernel, but also for compiler, assembler, editor, etc.
And these UNIX systems have C compiler included, then C became widespreaded. Everybody started to write in C (http://www.youtube.com/watch?v=1S1fISh-pag).

Now why are those things are written in C?

C is well known
C is portable
C provides a little higher abstraction than asm
C can be compiled into efficient machine code (given a smart compiler)
C is not that complicated (compared to C++, for example)

I think C is fine for doing system programming. I don't buy the old UNIX hackers mindset though, that is to write everything in C. If people asked me to write applications, I will choose languages "higher" than C, like Delphi or Java, for example.

Title: Re: C/C++ vs Assembler
Post by: Jibz on May 05, 2013, 07:00:18 AM

I've always found it odd to try to compare a HLL to assembly language -- since any HLL, interpreted or compiled, will in the end result in execution of assembly language instructions, it is obvious that no HLL can ever be faster than optimal assembly language.

What this fails to take into account, is of course that it requires considerable knowledge, skill, and usually time, to get even a near-optimal solution in assembly language. Also, once you have reached a near-optimal solution, even small changes in the specification, can result in large amounts of work to change the code to solve the new problem in a near-optimal way.

There are few situations where the extra effort is worth it; usually inner loops in time critical code. Elsewhere, I believe spending a fraction of the time writing, debugging, and maintaining the code in a HLL makes more sense :t.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 05, 2013, 08:58:54 AM

Quote from: Jibz on May 05, 2013, 07:00:18 AMit requires considerable knowledge, skill, and usually time, to get even a near-optimal solution in assembly language.

Yes it requires an effort, once, to get a library function working, but so what? Over years, we had a lot of fun picking a CRT function, e.g. strcmp, and giving it a thorough beating, see here (http://masm32.com/board/index.php?topic=1167.msg11925#msg11925). And why is a factor five faster than CRT only "near-optimal"?

Now that may be an extreme example, but over time we have seen many CRT functions been replaced by assembler equivalents that did the job in less than half the time.

So much for speed and innermost loops. But what about a real life app? Does it really take longer to write invoke CreateWindowEx instead of CreateWindowEx();, with three extra chars that will ruin your fingers over the years?

Sure it will take more time if you mean "pure" assembler - Hutch maintains a nice demo at \Masm32\examples\exampl07\slickhuh

But only masochists use pure assembler, reasonable coders use macros mixed with pure assembler. For example, loading a text file from disk and shuffling it into a string array costs me one line. One.

Handling arrays, whether "double" aka REAL8 or REAL4 or any integer size, can be easier with assembly macros than with most HLLs including C/C++

qWord has written a marvelous simple math library (http://sourceforge.net/projects/smplmath/) that makes arithmetics as easy as in any other HLL.

We have rejecting and non-rejecting loops, C and Basic-style for loops.
We have a Switch/Case macro, and it has a lot more power (http://masm32.com/board/index.php?topic=1185.0) than its C equivalent.
We have several macros for SEH, e.g. mine (http://masm32.com/board/index.php?topic=185.0) - but almost nobody uses them. Because we control our code so tightly that we would not allow a release version throwing exceptions.

Which brings me to my last point: Quality. HLL must be better, right? That's theory, practice is that assembler forces the coder to reflect thoroughly on each and every line, and that produces better quality than code that relies on "my compiler knows how to do that".

Of course, C/C++ is everywhere, and most of the big commercial apps are written in that language. That is why I am not impressed, I see too much bullshit produced from apparently huge teams of programmers in ~~respected~~ big software companies. MS Word forgetting to redraw, for example. And every time I shut my puter down instead of hibernating it, on reboot Adobe pops up and solemnly swears that this time the known security and performance problems are finally solved. IMHO they will be solved when Adobe goes back to BASIC.

[edit: fixed a garbled phrase]

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 05, 2013, 10:20:27 AM

I am a moderate here, I am of the view that Microsoft have it right in providing BOTH CL.EXE and ML.EXE with a compatible object module format so that you can link them both together so you can get the best (or worst) of both worlds in an application. What I am not a fan of is inline assembler in an optimising compiler as it makes a mess of both capacities, to this extent one of the few things I approve of in 64 bit code is the removal of inline assembler so that you MUST write a separate module for assembler code.

Now having cast my eye over a lot of C and assembler code in my time, one thing you CAN garrantee is that badly written code in either performs badly, equally well written code in either performs well enough in most instances and an appropriate level of competence in both is necessary to perform any sensible comparison.

Then there is the readability issue and I have seen enough of both to say that much of both is unreadable, with bad assembler it looks something like an assembler dump and while I am practiced at reading this stuff, it is no joy to work on. With badly written C I have seen some of the most appalling messes with the author trying to be profound in bundling a whole heap of functions together and non-symetrical brace indentation and to read or fix it you have to carefully pick it apart and re-edit it to make sense of it.

My own C code is still written 1990 K&R style so I can read it but I write very little C these days, that what I have MASM for. In a forum of this type where there are a large number of people literate in multiple languages, a topic of this type is little better than noise as most members know the difference and use the tool of their choice for the task they have in mind.

For folks who are happy t use mixed languages there are viable options using both.

C + asm
asm + C
asm
C

It easy enough to write a separate module in C and then link it into a MASM app as well as the other way around. I remember some years ago finding this massive collection of sort algos written in C so I compiled each one into a module and tested them in a MASM test piece. The few that were any good I converted directly to MASM, mainly for the practice of optimising the C code to get a bit more pace out of it.

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 05, 2013, 12:43:24 PM

I will second Hutch's proposal.

(a bit kidding now)

try {
Jochen, I agree with all sentences of your speech,
}
except {

Quote from: jj2007 on May 05, 2013, 08:58:54 AM
We have several macros for SEH, e.g. mine (http://masm32.com/board/index.php?topic=185.0) - but almost nobody uses them. Because we control our code so tightly that we would not allow a release version throwing exceptions.

}

I'll say that it mostly means that "almost nobody" writes the code which really needs these things nor bothers with any other exception handling than showing own message about exception and exiting, avoiding final "Send ~~all your data~~ crash report to the ~~moon~~ you know where" OS' dialog.
Really, there are much tasks which may require handling of planned exceptions, and that would not be a flaw in the program design nor the flaw of programmer's competency, it will instead be part of design, the right feature.
As a simple example - catching an In-page I/O exception (C0000006H) when one is reading the huge sparse memory-mapped file from the disk where there is no much free space to hold that file if it would have been not sparse.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 05, 2013, 05:33:44 PM

Quote from: Antariy on May 05, 2013, 12:43:24 PM
Quote from: jj2007 on May 05, 2013, 08:58:54 AM
We have several macros for SEH, e.g. mine (http://masm32.com/board/index.php?topic=185.0) - but almost nobody uses them. Because we control our code so tightly that we would not allow a release version throwing exceptions.

As a simple example - catching an In-page I/O exception (C0000006H) when one is reading the huge sparse memory-mapped file from the disk where there is no much free space to hold that file if it would have been not sparse.

Alex,

You are right, of course SEH has its legit uses. Thanks for the example.

But in 99% of all cases you can live without SEH, or limit it to a subproc. The OS calls them exceptions because they should be exceptional events.

However, it seems to be extremely widespread in C/C++ programming to a) install an SEH right at the beginning and b) deliberately raise exceptions for little errors that could have been avoided by simple error checking, or fixed in more benign ways.

Somebody will surely pop up with a good theory why computer science requires that there must be a mov fs:[0], esp at line 7 of the disassembly, but a) all of my BASIC compilers could live comfortably without SEH and b) my suspicion is that users of products of certain ~~respected~~ big software companies didn't like those ugly message boxes, so the marketing department asked for SEH... ;-)

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 05, 2013, 09:59:26 PM

Jochen, now I agree with you totally :biggrin:

Title: Re: C/C++ vs Assembler
Post by: Manos on May 05, 2013, 10:09:13 PM

I tested three functions that do the same work, using MASM.

1). The Win API lstrlen
2). The Hutch's library StrLen
3). The strlen of MSVCRT.LIB that I have in my System, (WinXP).

Follow are the source and the results:

includelib \MSVCRT.LIB
strlen PROTO C pstr:DWORD

LOCAL dwTime:DWORD

invoke GetTickCount
      mov dwTime, eax
      push esi
      xor esi, esi
      TestLoop:
;   invoke lstrlen, addr szText
;   invoke StrLen, addr szText
      invoke strlen, addr szText
      inc esi
      cmp esi, 10000000
      jb TestLoop

   pop esi

   invoke GetTickCount
   sub eax, dwTime
   PrintDec eax

Results:
lstrlen 328 ticks
StrLen 94 ticks
strlen 94 ticks

Manos.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 05, 2013, 10:23:48 PM

Ok, once more...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 17/10 cycles

2510 cycles for 10 * lstrlen
972 cycles for 10 * StrLen
1450 cycles for 10 * crt_strlen
371 cycles for 10 * Len

2512 cycles for 10 * lstrlen
972 cycles for 10 * StrLen
1515 cycles for 10 * crt_strlen
371 cycles for 10 * Len

2510 cycles for 10 * lstrlen
971 cycles for 10 * StrLen
1405 cycles for 10 * crt_strlen
371 cycles for 10 * Len

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 05, 2013, 10:34:28 PM

Jochen,

you gave the right answer. :t All these HLL vs Assembly discussion (I've seen a lot over the years) are a bit fruitless. Here are my results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 28/10 cycles

1940   cycles for 10 * lstrlen
544   cycles for 10 * StrLen
1367   cycles for 10 * crt_strlen
169   cycles for 10 * Len

1987   cycles for 10 * lstrlen
570   cycles for 10 * StrLen
784   cycles for 10 * crt_strlen
172   cycles for 10 * Len

2558   cycles for 10 * lstrlen
557   cycles for 10 * StrLen
1367   cycles for 10 * crt_strlen
174   cycles for 10 * Len

100   = eax lstrlen
100   = eax StrLen
100   = eax crt_strlen
100   = eax Len

--- ok ---

Gunther

Title: Re: C/C++ vs Assembler
Post by: Manos on May 05, 2013, 10:41:42 PM

Quote from: Gunther on May 05, 2013, 10:34:28 PM
Jochen,

you gave the right answer. :t All these HLL vs Assembly discussion (I've seen a lot over the years) are a bit fruitless. Here are my results:

--- ok ---

Which crt_strlen you used ?

Manos.

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 05, 2013, 10:51:10 PM

This is the version of "strlen" in my XP version of MSVCRT. It does not look like compiler generated code and it is about 2 years after Agner Fog designed his StrLen algo which is in the MASM32 library.

strlen:
mov ecx, [esp+4]
test ecx, 3
jz lbl1

lbl0:
mov al, [ecx]
inc ecx
test al, al
jz lbl2
test ecx, 3
jnz lbl0
add eax, 0

lbl1:
mov eax, [ecx]
mov edx, 7EFEFEFFh
add edx, eax
xor eax, 0FFFFFFFFh
xor eax, edx
add ecx, 4
test eax, 81010100h
jz lbl1
mov eax, [ecx-4]
test al, al
jz lbl5
test ah, ah
jz lbl4
test eax, 0FF0000h
jz lbl3
test eax, 0FF000000h
jz lbl2
jmp lbl1

lbl2:
lea eax, [ecx-1]
mov ecx, [esp+4]
sub eax, ecx
ret

lbl3:
lea eax, [ecx-2]
mov ecx, [esp+4]
sub eax, ecx
ret

lbl4:
lea eax, [ecx-3]
mov ecx, [esp+4]
sub eax, ecx
ret

lbl5:
lea eax, [ecx-4]
mov ecx, [esp+4]
sub eax, ecx
ret

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 06, 2013, 12:16:47 AM

Manos,

Quote from: Manos on May 05, 2013, 10:41:42 PM
Which crt_strlen you used ?

Manos.

Jochen has the source included. I had to repeat the tests again under my 32 bit XP under VirtualPC. Here are the results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 37/10 cycles
1855   cycles for 10 * lstrlen
548   cycles for 10 * StrLen
608   cycles for 10 * crt_strlen
165   cycles for 10 * Len

2492   cycles for 10 * lstrlen
555   cycles for 10 * StrLen
594   cycles for 10 * crt_strlen
164   cycles for 10 * Len

2483   cycles for 10 * lstrlen
547   cycles for 10 * StrLen
606   cycles for 10 * crt_strlen
452   cycles for 10 * Len

100   = eax lstrlen
100   = eax StrLen
100   = eax crt_strlen
100   = eax Len

--- ok ---

Gunther

Title: Re: C/C++ vs Assembler
Post by: Manos on May 06, 2013, 01:08:34 AM

Quote from: hutch-- on May 05, 2013, 10:51:10 PM
This is the version of "strlen" in my XP version of MSVCRT. It does not look like compiler generated code and it is about 2 years after Agner Fog designed his StrLen algo which is in the MASM32 library.

Steve,
where is your strlen version ?
I searched this in your library doday, before do my last post for testing, but not found.

The follow is the MS crt version for Intel:

page ,132
title strlen - return the length of a null-terminated string
;***
;strlen.asm - contains strlen() routine
;
; Copyright (c) 1985-1997, Microsoft Corporation. All rights reserved.
;
;Purpose:
; strlen returns the length of a null-terminated string,
; not including the null byte itself.
;
;*******************************************************************************

.xlist
include cruntime.inc
.list

page
;***
;strlen - return the length of a null-terminated string
;
;Purpose:
; Finds the length in bytes of the given string, not including
; the final null character.
;
; Algorithm:
; int strlen (const char * str)
; {
; int length = 0;
;
; while( *str++ )
; ++length;
;
; return( length );
; }
;
;Entry:
; const char * str - string whose length is to be computed
;
;Exit:
; EAX = length of the string "str", exclusive of the final null byte
;
;Uses:
; EAX, ECX, EDX
;
;Exceptions:
;
;*******************************************************************************

CODESEG

public strlen

strlen proc

.FPO ( 0, 1, 0, 0, 0, 0 )

string equ [esp + 4]

mov ecx,string ; ecx -> string
test ecx,3 ; test if string is aligned on 32 bits
je short main_loop

str_misaligned:
; simple byte loop until string is aligned
mov al,byte ptr [ecx]
inc ecx
test al,al
je short byte_3
test ecx,3
jne short str_misaligned

add eax,dword ptr 0 ; 5 byte nop to align label below

align 16 ; should be redundant

main_loop:
mov eax,dword ptr [ecx] ; read 4 bytes
mov edx,7efefeffh
add edx,eax
xor eax,-1
xor eax,edx
add ecx,4
test eax,81010100h
je short main_loop
; found zero byte in the loop
mov eax,[ecx - 4]
test al,al ; is it byte 0
je short byte_0
test ah,ah ; is it byte 1
je short byte_1
test eax,00ff0000h ; is it byte 2
je short byte_2
test eax,0ff000000h ; is it byte 3
je short byte_3
jmp short main_loop ; taken if bits 24-30 are clear and bit
; 31 is set

byte_3:
lea eax,[ecx - 1]
mov ecx,string
sub eax,ecx
ret
byte_2:
lea eax,[ecx - 2]
mov ecx,string
sub eax,ecx
ret
byte_1:
lea eax,[ecx - 3]
mov ecx,string
sub eax,ecx
ret
byte_0:
lea eax,[ecx - 4]
mov ecx,string
sub eax,ecx
ret

strlen endp

end

Manos.

P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 06, 2013, 01:14:07 AM

Manos,

Same plce its always been, strlen.asm in the m32lib directory. It is the version written in 1996 by Agner Fog.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 06, 2013, 02:52:17 AM

Quote from: hutch-- on May 05, 2013, 10:51:10 PM
This is the version of "strlen" in my XP version of MSVCRT.

OK, included as crt_strlen2 (identical with Manos' "MS crt version for Intel"):
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 17/10 cycles

2677 cycles for 10 * lstrlen
973 cycles for 10 * StrLen
1484 cycles for 10 * crt_strlen
372 cycles for 10 * Len
1348 cycles for 10 * crt_strlen2

2682 cycles for 10 * lstrlen
973 cycles for 10 * StrLen
1464 cycles for 10 * crt_strlen
372 cycles for 10 * Len
1346 cycles for 10 * crt_strlen2

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 06, 2013, 03:29:06 AM

Manos,

Quote from: Manos on May 06, 2013, 01:08:34 AM
P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!

that's right, but it has to do with the number of posts you've made. It's the forum software.

Gunther

Title: Re: C/C++ vs Assembler
Post by: Manos on May 06, 2013, 04:22:14 AM

Quote from: Gunther on May 06, 2013, 03:29:06 AM
that's right, but it has to do with the number of posts you've made. It's the forum software.

Gunther

Yes, you are right, but in some forums Administrator can change this characteristic.
For example, in my forum I have in Admin control panel this:

Manage groups
From this panel you can administer all your usergroups. You can delete, create and edit existing groups. Furthermore, you may choose group leaders, toggle open/hidden/closed group status and set the group name and description.

User defined groups
These are groups created by you or another admin on this board. You can manage memberships as well as edit group properties or even delete the group.

Manos.

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 06, 2013, 07:49:46 AM

Manos,

Quote from: Manos on May 06, 2013, 04:22:14 AM
Yes, you are right, but in some forums Administrator can change this characteristic.

that's clear. On the other hand, writing some more posts and being active in the forum isn't so hard. We're a very lively forum, but it's always good to have experienced and hard working coders like you on the side. :t

Gunther

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 06, 2013, 01:23:07 PM

OK, here is test for more or less sophisticated integer code (some of us remember this piece :biggrin:)

The C code is in Axhex2dw.c file, it linked with an ASM source with timing testbed:

Code Select


INT_PTR __stdcall Axhex2dw_C(char* ptc){
	INT_PTR result=0;
	while(*ptc){
		result=((result<<4)+(*ptc&0xF)+((*ptc>>6)*9));
		++ptc;
	}
	return result;
}

The C code was build with MSVC10, with different optimization settings, the .OBJ file included into archive is one with maximal optimization for performance.

OK, here is the disassembly for the C code optimized to be small:

Code Select


_Axhex2dw_C@4:
  00000000: 8B 54 24 04        mov         edx,dword ptr [esp+4]
  00000004: 8A 0A              mov         cl,byte ptr [edx]
  00000006: 33 C0              xor         eax,eax
  00000008: 84 C9              test        cl,cl
  0000000A: 74 1E              je          0000002A
  0000000C: 56                 push        esi
  0000000D: 0F BE C9           movsx       ecx,cl
  00000010: 8B F1              mov         esi,ecx
  00000012: C1 FE 06           sar         esi,6
  00000015: 6B F6 09           imul        esi,esi,9
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 03 F1              add         esi,ecx
  0000001D: C1 E0 04           shl         eax,4
  00000020: 03 C6              add         eax,esi
  00000022: 42                 inc         edx
  00000023: 8A 0A              mov         cl,byte ptr [edx]
  00000025: 84 C9              test        cl,cl
  00000027: 75 E4              jne         0000000D
  00000029: 5E                 pop         esi
  0000002A: C2 04 00           ret         4

45 bytes long.

The timings:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
100     cycles for C version
51      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
99      cycles for C version
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
99      cycles for C version

--- ok ---

Axhex2dw1 (the "Small 1") is 69 bytes long, Axhex2dw2 (the "Small 2") is 48 bytes long.

OK, now the test with maximum performance optimization for C code.

The disassembly:

Code Select


_Axhex2dw_C@4:
  00000000: 56                 push        esi
  00000001: 8B 74 24 08        mov         esi,dword ptr [esp+8]
  00000005: 8A 0E              mov         cl,byte ptr [esi]
  00000007: 33 C0              xor         eax,eax
  00000009: 84 C9              test        cl,cl
  0000000B: 74 20              je          0000002D
  0000000D: 8D 49 00           lea         ecx,[ecx]
  00000010: 0F BE C9           movsx       ecx,cl
  00000013: 8B D1              mov         edx,ecx
  00000015: C1 FA 06           sar         edx,6
  00000018: 83 E1 0F           and         ecx,0Fh
  0000001B: 8D 14 D2           lea         edx,[edx+edx*8]
  0000001E: 03 D1              add         edx,ecx
  00000020: 8A 4E 01           mov         cl,byte ptr [esi+1]
  00000023: 46                 inc         esi
  00000024: C1 E0 04           shl         eax,4
  00000027: 03 C2              add         eax,edx
  00000029: 84 C9              test        cl,cl
  0000002B: 75 E3              jne         00000010
  0000002D: 5E                 pop         esi
  0000002E: C2 04 00           ret         4

49 bytes long. You may see that the compiler used LEA to multiply EDX by 9 - the algo in C was intentionally written in such a way that optimizing compiler will produce "at first look" the code similar to the handwritten code, i.e. it was written speed-optimized already in HLL, and you can see that inner loop logic is the same as in the handwritten code, so, this time the timings for the C code should probably be very good, but...

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

46      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
51      cycles for Small 4
79      cycles for C version
43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
50      cycles for Small 3.1
43      cycles for Small 4
86      cycles for C version
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version

--- ok ---

The logic is the same (of course, algo is the same), but the implementation is different. So, like I said in my first message in the thread, sophisticated (even this one, that's not really too sophisticated) algos aren't the best things that even optimizing compiler can produce well. Also some things like grabbing a char, then promoting it to dword with two instructions instead of one, then the strange compiler's fear to get the same byte twice from a memory reference, etc - these things make algo slower. Well, being the program, the compiler "writes" very good code, we should agree with that :biggrin: It's really big and hard people's work behind the short description "Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.30319.01 for 80x86" :eusa_clap:

It will be very interesting to see the results on different machines, as this time this is also the test "for which machines does Microsoft optimize their compilers?" - i.e. on which machines VC's code performs better - so we can say "on this machine Windows (Word/Photoshop/etc) works faster than on that (with equal CPU freq)! Just because it uses MSVC" :lol:

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 06, 2013, 01:43:35 PM

You don't have to be a genius to know that this forum is DIFFERENT to the last one that was hosted in the UK. After a lot of work I have made an archived version of the old forum available but everybody who is a member of this forum started with a zero post count, me included as the administrator. I did not write the software and nor do I care who likes it or not, it does the job and it is not going to be modified to suit a quirk of that few folks who have not done the work to build a new forum and archive the old one.

The advice has already been given by more active members, make some more posts.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 06, 2013, 02:44:17 PM

Quote from: Antariy on May 06, 2013, 01:23:07 PM
It will be very interesting to see the results on different machines

Hi Alex,
On paper we have the same machine but results are different:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
41 cycles for Small 1
38 cycles for Small 2
41 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
75 cycles for C version
41 cycles for Small 1
49 cycles for Small 2
41 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
50 cycles for C version
41 cycles for Small 1
38 cycles for Small 2
53 cycles for Small 3
41 cycles for Small 3.1
41 cycles for Small 4
51 cycles for C version

The 75 cycles peak is not an accident, it's there for every run I tried.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

39 cycles for Small 1
42 cycles for Small 2
39 cycles for Small 3
39 cycles for Small 3.1
39 cycles for Small 4
44 cycles for C version

Title: Re: C/C++ vs Assembler
Post by: habran on May 06, 2013, 02:55:35 PM

Code Select


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

15      cycles for Small 1
16      cycles for Small 2
16      cycles for Small 3
39      cycles for Small 3.1
14      cycles for Small 4
26      cycles for C version
17      cycles for Small 1
16      cycles for Small 2
16      cycles for Small 3
16      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
16      cycles for Small 1
17      cycles for Small 2
18      cycles for Small 3
19      cycles for Small 3.1
18      cycles for Small 4
26      cycles for C version

--- ok ---

Title: Re: C/C++ vs Assembler
Post by: sinsi on May 06, 2013, 03:12:23 PM

I wonder how much influence the OS has in these timings too, like whether running 32-bit code on a 64-bit OS has a penalty?

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)

22 cycles for Small 1
23 cycles for Small 2
24 cycles for Small 3
24 cycles for Small 3.1
24 cycles for Small 4
28 cycles for C version
23 cycles for Small 1
24 cycles for Small 2
20 cycles for Small 3
19 cycles for Small 3.1
19 cycles for Small 4
34 cycles for C version
20 cycles for Small 1
20 cycles for Small 2
20 cycles for Small 3
20 cycles for Small 3.1
20 cycles for Small 4
33 cycles for C version

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 06, 2013, 03:13:13 PM

Hi, Jochen :biggrin:

I think it's maybe because of first-time pass - code cache did not probably purge and in next tests at the same run your CPU has already ready "food". And it seem that at first time the code with superfluous instructions and not best arranged logic takes much longer to be decoded and executed, than next time it just executed from the code cache. But, speaking of "real life"(tm) app, this means that this code is crazly unoptimal (since in real app it will not be called millions of times consequently, so, the best code is that code which also gets decoded faster... but, again, what means +/- 30 clocks if the proc will be called, let's say, once in a second... that's the true reason why compilers are so popular for "general programming"... but, but, again, the entire proggie consists from these small bits and if every of it will be a bit faster, entire proggie will be faster... well, philosophy is starting here :greensml:)

It's interesting also how different tweaks of the assembly versions of Axhex2dw perform. "Small 1" (you remember it, of course :biggrin:) still looks like more or less stable in the timings. And, even being longer than C version in therms of code size, it still faster.

Hi, habran, thanks for testing it :t

Hi, John, thank you, too :biggrin: As for influence of the OS - I think you're right, and switching to a 32 bit execution context has a penalty under 64 bit OS, like it is for 16 bit apps under 32 bit OS (but since my x64 experience is small - I cannot say it 100%).

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 06, 2013, 05:25:12 PM

There is an old rule that occasionally applies, if you get a good enough algorithm, then the coding does not matter as much. I have an example in mind, a hybrid sort of Robert Sedgewick originally written in C and it was genuinely fast. I used it as an algorithm to test out a tool I was designing and converted it directly to unoptimised assembler, removed the stack frame, dropped the instruction count, inlined all of the satellite functions and it would not go faster than the C original.

I got a lower better optimised instruction sequence and it stubbornly refused to go faster. It does not happen all that often but it was interesting to see. Basically a perfect algorithm that was highly insensitive to coding technique.

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 06, 2013, 06:05:03 PM

Quote from: Antariy on May 06, 2013, 01:23:07 PM
OK, now the test with maximum performance optimization for C code.

The disassembly:

Code Select Expand
_Axhex2dw_C@4: 00000000: 56 push esi 00000001: 8B 74 24 08 mov esi,dword ptr [esp+8] 00000005: 8A 0E mov cl,byte ptr [esi] 00000007: 33 C0 xor eax,eax 00000009: 84 C9 test cl,cl 0000000B: 74 20 je 0000002D 0000000D: 8D 49 00 lea ecx,[ecx] 00000010: 0F BE C9 movsx ecx,cl 00000013: 8B D1 mov edx,ecx 00000015: C1 FA 06 sar edx,6 00000018: 83 E1 0F and ecx,0Fh 0000001B: 8D 14 D2 lea edx,[edx+edx*8] 0000001E: 03 D1 add edx,ecx 00000020: 8A 4E 01 mov cl,byte ptr [esi+1] 00000023: 46 inc esi 00000024: C1 E0 04 shl eax,4 00000027: 03 C2 add eax,edx 00000029: 84 C9 test cl,cl 0000002B: 75 E3 jne 00000010 0000002D: 5E pop esi 0000002E: C2 04 00 ret 4

49 bytes long.

Hi Alex,

I invested "considerable knowledge, skill, and half an hour of my precious time, to get a near-optimal solution in assembly language".

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
39 cycles for Small 1
42 cycles for Small 2
39 cycles for Small 3
39 cycles for Small 3.1
39 cycles for Small 4
46 cycles for C version
38 cycles for C mod JJ

I hope 17% improvement on the optimising C compiler qualifies as a "near-optimal solution" ;-)

Axhex2dw_CJ proc src   ; old C original modified by JJ
   push esi
   mov esi, dword ptr [esp+8]
   movsx ecx, byte ptr [esi]
   xor eax, eax
   jmp @go
;    test ecx, ecx
;    je bye
@@:   ; movsx ecx, cl   ; now superfluous
   mov edx, ecx
   sar edx, 6
   and ecx, 0Fh
   lea edx, [edx+edx*8]
   add edx, ecx
   movsx ecx, byte ptr [esi+1]
   shl eax, 4
   inc esi
   add eax, edx
@go:   test ecx, ecx
   jne @B
bye:   pop esi
   ret 4
Axhex2dw_CJ endp

Title: Re: C/C++ vs Assembler
Post by: Manos on May 06, 2013, 06:49:02 PM

Quote from: hutch-- on May 06, 2013, 01:43:35 PM
I did not write the software and nor do I care who likes it or not, it does the job and it is not going to be modified to suit a quirk of that few folks who have not done the work to build a new forum and archive the old one.

The advice has already been given by more active members, make some more posts.

I Know how forums software works and I know to program forum software with VBasic and Java scripts.

When I wrote:
P.S.
Below my name on the left of your forum writes:
Manos
New Member.

But I am one of the first members since 2004.
It would be better to write:
Old Member !!!,

Just, I had done a joke.
But some people have not understood my spirit of my words.
If I had the self-exaltation to see my name with stars,
I could post in this forum good morning, good afternoon and good night every day.

Manos.

P.S.
Your old last forum in U.K. was very faster.

Title: Re: C/C++ vs Assembler
Post by: anta40 on May 06, 2013, 07:24:03 PM

Quote from: hutch-- on May 06, 2013, 05:25:12 PM
I have an example in mind, a hybrid sort of Robert Sedgewick originally written in C and it was genuinely fast.

Is the code listing available somewhere on the internet?
Or is it available in his book, i.e Algorithms in C ?

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 06, 2013, 07:28:52 PM

Hi Jochen :t

In that post I also noted that for "first look" compiler does good job, but some small bits seem to be an ancient assumptions in its optimization techniques. Probably, like Hutch said, this is just a case when the algo is in the its edge of its performance - well, its creation had grown in front of your eyes, you remember :biggrin: So, this is very good example of Human vs compiler - the same algo => the same inner loop logic in HLL and ASM => the same inner loop code in compiler generated code => and STILL the Human can improve compiler's work for particular task and/or hardware. And you have just demonstrated it :t

For my CPU timings almost do not change (incredible!) - I ran it multiple times, these timings are smallest:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ

43      cycles for Small 1
44      cycles for Small 2
37      cycles for Small 3
47      cycles for Small 3.1
46      cycles for Small 4
79      cycles for C version
72      cycles for C mod JJ

43      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
46      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

But your CPU obviously does not like longer and superfluous code that generated by the compiler, it still likes Human's code :biggrin: Interesting how more modern CPUs will run it.

Title: Re: C/C++ vs Assembler
Post by: habran on May 06, 2013, 08:06:40 PM

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

19 cycles for Small 1
10 cycles for Small 2
39 cycles for Small 3
38 cycles for Small 3.1
19 cycles for Small 4
26 cycles for C version
19 cycles for C mod JJ

17 cycles for Small 1
17 cycles for Small 2
18 cycles for Small 3
18 cycles for Small 3.1
18 cycles for Small 4
26 cycles for C version
20 cycles for C mod JJ

16 cycles for Small 1
19 cycles for Small 2
19 cycles for Small 3
18 cycles for Small 3.1
18 cycles for Small 4
25 cycles for C version
19 cycles for C mod JJ

48 bytes for Axhex2dw_C2
43 bytes for Axhex2dw_CJ
ABCDEF01 returned
--- ok ---

Title: Re: C/C++ vs Assembler
Post by: qWord on May 06, 2013, 09:33:07 PM

Behold and see...

Code Select

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

17      cycles for Small 1
18      cycles for Small 2
19      cycles for Small 3
18      cycles for Small 3.1
19      cycles for Small 4
25      cycles for C version
19      cycles for C mod JJ
18      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
11      hsz2dw2 (unrolled 4 times)

17      cycles for Small 1
23      cycles for Small 2
16      cycles for Small 3
17      cycles for Small 3.1
23      cycles for Small 4
24      cycles for C version
19      cycles for C mod JJ
16      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
13      hsz2dw2 (unrolled 4 times)

20      cycles for Small 1
15      cycles for Small 2
19      cycles for Small 3
19      cycles for Small 3.1
19      cycles for Small 4
26      cycles for C version
19      cycles for C mod JJ
16      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
13      hsz2dw2 (unrolled 4 times)


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

Code Select

unsigned int hsz2dw2(char* psz){
	const static unsigned char lut[256] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0
										   ,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
										   ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 };
	
	register unsigned int c;
	register unsigned int r=0;
	register unsigned char* p = (unsigned char*) psz;
	while(1) {
		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;

		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;

		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;

		if(!(c=*p))
			break;
		r<<=4;
		r+=lut[c];
		p++;
	}
	return r;
}

:biggrin:

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 06, 2013, 11:05:26 PM

Refuses to optimise for AMD ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
39 cycles for Small 1
42 cycles for Small 2
39 cycles for Small 3
39 cycles for Small 3.1
39 cycles for Small 4
46 cycles for C version
38 cycles for C mod JJ
47 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
43 hsz2dw2 (unrolled 4 times)

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 06, 2013, 11:07:31 PM

:biggrin:

> I Know how forums software works and I know to program forum software with VBasic and Java scripts.

You would be surprised just how bad it would run on this 64 bit Unix server. The forum is written in PHP, not VBscript and JAVA and NO the forum software will not be modified.

Title: Re: C/C++ vs Assembler
Post by: FORTRANS on May 06, 2013, 11:09:58 PM

Hi,

Three more data points for you.

Cheers,

Code Select


 (SSE1)

61	cycles for Small 1
57	cycles for Small 2
59	cycles for Small 3
60	cycles for Small 3.1
60	cycles for Small 4
71	cycles for C version
65	cycles for C mod JJ
57	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
28	hsz2dw2 (unrolled 4 times)

60	cycles for Small 1
59	cycles for Small 2
59	cycles for Small 3
60	cycles for Small 3.1
62	cycles for Small 4
71	cycles for C version
63	cycles for C mod JJ
55	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
29	hsz2dw2 (unrolled 4 times)

60	cycles for Small 1
60	cycles for Small 2
59	cycles for Small 3
60	cycles for Small 3.1
60	cycles for Small 4
71	cycles for C version
63	cycles for C mod JJ
55	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
28	hsz2dw2 (unrolled 4 times)


48	 bytes for Axhex2dw_C2
43	 bytes for Axhex2dw_CJ
ABCDEF01	returned
--- ok ---

120	cycles for Small 1
143	cycles for Small 2
125	cycles for Small 3
132	cycles for Small 3.1
131	cycles for Small 4
117	cycles for C version
107	cycles for C mod JJ
101	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100	hsz2dw2 (unrolled 4 times)

119	cycles for Small 1
141	cycles for Small 2
136	cycles for Small 3
125	cycles for Small 3.1
123	cycles for Small 4
117	cycles for C version
106	cycles for C mod JJ
102	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100	hsz2dw2 (unrolled 4 times)

117	cycles for Small 1
139	cycles for Small 2
130	cycles for Small 3
125	cycles for Small 3.1
127	cycles for Small 4
117	cycles for C version
106	cycles for C mod JJ
102	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
100	hsz2dw2 (unrolled 4 times)


48	 bytes for Axhex2dw_C2
43	 bytes for Axhex2dw_CJ
ABCDEF01	returned
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

41	cycles for Small 1
38	cycles for Small 2
45	cycles for Small 3
41	cycles for Small 3.1
41	cycles for Small 4
51	cycles for C version
56	cycles for C mod JJ
31	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24	hsz2dw2 (unrolled 4 times)

41	cycles for Small 1
38	cycles for Small 2
41	cycles for Small 3
41	cycles for Small 3.1
41	cycles for Small 4
51	cycles for C version
50	cycles for C mod JJ
31	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24	hsz2dw2 (unrolled 4 times)

41	cycles for Small 1
38	cycles for Small 2
41	cycles for Small 3
41	cycles for Small 3.1
41	cycles for Small 4
51	cycles for C version
50	cycles for C mod JJ
31	hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24	hsz2dw2 (unrolled 4 times)


48	 bytes for Axhex2dw_C2
43	 bytes for Axhex2dw_CJ
ABCDEF01	returned
--- ok ---

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 06, 2013, 11:19:20 PM

Hi qWord :t

Yeah, unrolling it is the way to make it faster, but the tested algo is not unrolled - that's the point. It's "classic" more or less small, "looped" code, these characteristics are intentional - that was not a contest but rather a test :biggrin:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
40      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)

43      cycles for Small 1
45      cycles for Small 2
66      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
37      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)

43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
79      cycles for C version
79      cycles for C mod JJ
37      hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
31      hsz2dw2 (unrolled 4 times)


48       bytes for Axhex2dw_C2
43       bytes for Axhex2dw_CJ
ABCDEF01        returned
--- ok ---

Actually, the testbed I posted was a trimmed version I've made some time ago... will post it now - it contains the procs which were in the contest here earlier (I used Axhex2dw just because is the fastest (at least till now) from tests hex2dw procs with the characteristics: case insensitive, does not check input, is looped (i.e. for every digit there is one loop iteration - not unrolled at all), and it looks like it is copyrighted by me, at least no one dispute the rights for ~3 years :lol:)

OK, here is the timings for the archive attached (it is old testbed):

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)



25      cycles for Fast version
27      cycles for Fast version under AMD
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
32      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
27      cycles for Fast version under AMD
43      cycles for Small 1
45      cycles for Small 2
43      cycles for Small 3
59      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
30      cycles for Fast version under AMD
43      cycles for Small 1
116     cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
46      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  69
Axhex2dw2 - 2:  48
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    174
krbhtodw:       547
--- ok ---

krbhtodw - the author is Dave (KeepingRealBusy) with minor changes made with his permission - it's the most universal proc - it check the input, it has possibility to process "ignorant chars". It's lookup table.
The fastest GPR code by Jochen (jj2007) - ax_jj_htodw - it's word-indexed lookuptable.

All not "Other's versions" are mine, but when posted in this thread I excluded every not GPR, every unrolled and/or every lookup table based versions. Well, there are new CPUs were released since then, and maybe it's interesting to test all these procs again :biggrin:

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 06, 2013, 11:46:31 PM

BTW: Jochen's code is 174 bytes long AND its lookup table is once initialized and does not take the space in the EXE, so it's not only the fastest, but the smallest from unrolled versions at the same time (in the size included the size of hex2dw code + size of table initialization code).

Title: Re: C/C++ vs Assembler
Post by: qWord on May 07, 2013, 12:53:42 AM

OK - I obviously missed the "spirit" of this thread...

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 07, 2013, 12:54:05 AM

Hi Alex,

here are the timings for your 32Alex's_hex2dw.exe:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

26   cycles for Fast version
29   cycles for Fast version under AMD
50   cycles for Small 1
49   cycles for Small 2
51   cycles for Small 3
54   cycles for Small 3.1
51   cycles for Small 4
14   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
54   cycles for Axhex2dw improved by Hutch (1)
62   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
35   cycles for Dave's version (with minor changes)

29   cycles for Fast version
32   cycles for Fast version under AMD
48   cycles for Small 1
52   cycles for Small 2
54   cycles for Small 3
54   cycles for Small 3.1
50   cycles for Small 4
11   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
65   cycles for Axhex2dw improved by Hutch (1)
60   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
34   cycles for Dave's version (with minor changes)

29   cycles for Fast version
31   cycles for Fast version under AMD
48   cycles for Small 1
53   cycles for Small 2
54   cycles for Small 3
54   cycles for Small 3.1
54   cycles for Small 4
14   cycles for MMX 1
15   cycles for MMX 2
15   cycles for SSE1

Other's Versions:
55   cycles for Axhex2dw improved by Hutch (1)
62   cycles for Axhex2dw improved by Hutch (2)

10   cycles for Lingo's SSE version
19   cycles for Lingo's BIG integer version
17   cycles for Jochen's WORD-Indexed version
34   cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:   396
Axhex2dw_Unrolled_AMD:   396
Axhex2dw1 - 1:   69
Axhex2dw2 - 2:   48
Axhex2dw3 - 3:   57
Axhex2dw3_1 - 3.1:   56
Axhex2dw3 - 4:   61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:   160
Axhex2dw_SSE:   160
Alex_Short_Hutch:   59
Axhex2dw_Hutch2:   54
Hex2dwLingoSSE:   160
lingo_htodw:   1950
ax_jj_htodw:   174
krbhtodw:   547
--- ok ---

Title: Re: C/C++ vs Assembler
Post by: dedndave on May 07, 2013, 01:16:18 AM

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

:biggrin: as if the subject has never come up before

(http://rationalwiki.org/w/images/7/75/Deadhorse.gif)

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 07, 2013, 02:07:53 AM

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

Hey, your code was actually quite good. Even if your 'piler refuses to optimise for my AMD ;)

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 07, 2013, 03:27:10 AM

Quote from: dedndave on May 07, 2013, 01:16:18 AM
:biggrin: as if the subject has never come up before

(http://rationalwiki.org/w/images/7/75/Deadhorse.gif)

oh yes, it's a very new topic. :lol: :lol: :lol:

Gunther

Title: Re: C/C++ vs Assembler
Post by: jj2007 on May 07, 2013, 03:31:40 AM

Quote from: qWord on May 06, 2013, 09:33:07 PM
Behold and see...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
41 cycles for Small 1
38 cycles for Small 2
40 cycles for Small 3
42 cycles for Small 3.1
41 cycles for Small 4
50 cycles for C version
53 cycles for C mod JJ
58 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24 hsz2dw2 (unrolled 4 times)

41 cycles for Small 1
39 cycles for Small 2
41 cycles for Small 3
57 cycles for Small 3.1
23 cycles for Small 4
51 cycles for C version
53 cycles for C mod JJ
31 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24 hsz2dw2 (unrolled 4 times)

43 cycles for Small 1
38 cycles for Small 2
41 cycles for Small 3
42 cycles for Small 3.1
41 cycles for Small 4
50 cycles for C version
51 cycles for C mod JJ
31 hsz2dw (Microsoft 32-Bit C/C++ optimization compiler v16.00.40219.01)
24 hsz2dw2 (unrolled 4 times)

Well, a LUT is difficult to beat ;-)

Title: Re: C/C++ vs Assembler
Post by: habran on May 07, 2013, 05:43:20 AM

Code Select


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)



7       cycles for Fast version
8       cycles for Fast version under AMD
16      cycles for Small 1
18      cycles for Small 2
19      cycles for Small 3
19      cycles for Small 3.1
19      cycles for Small 4
4       cycles for MMX 1
5       cycles for MMX 2
4       cycles for SSE1

Other's Versions:
23      cycles for Axhex2dw improved by Hutch (1)
22      cycles for Axhex2dw improved by Hutch (2)

3       cycles for Lingo's SSE version
7       cycles for Lingo's BIG integer version
21      cycles for Jochen's WORD-Indexed version
12      cycles for Dave's version (with minor changes)


11      cycles for Fast version
14      cycles for Fast version under AMD
17      cycles for Small 1
19      cycles for Small 2
18      cycles for Small 3
19      cycles for Small 3.1
18      cycles for Small 4
5       cycles for MMX 1
4       cycles for MMX 2
4       cycles for SSE1

Other's Versions:
22      cycles for Axhex2dw improved by Hutch (1)
21      cycles for Axhex2dw improved by Hutch (2)

2       cycles for Lingo's SSE version
7       cycles for Lingo's BIG integer version
5       cycles for Jochen's WORD-Indexed version
12      cycles for Dave's version (with minor changes)


10      cycles for Fast version
13      cycles for Fast version under AMD
17      cycles for Small 1
17      cycles for Small 2
18      cycles for Small 3
19      cycles for Small 3.1
19      cycles for Small 4
5       cycles for MMX 1
6       cycles for MMX 2
5       cycles for SSE1

Other's Versions:
23      cycles for Axhex2dw improved by Hutch (1)
22      cycles for Axhex2dw improved by Hutch (2)

4       cycles for Lingo's SSE version
7       cycles for Lingo's BIG integer version
6       cycles for Jochen's WORD-Indexed version
11      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  69
Axhex2dw2 - 2:  48
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    174
krbhtodw:       547
--- ok ---

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 07, 2013, 06:25:14 AM

Jochen,

Quote from: jj2007 on May 07, 2013, 03:31:40 AM
Well, a LUT is difficult to beat ;-)

that's true, but an old wisdom.

Gunther

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 07, 2013, 08:43:59 AM

Quote from: qWord on May 07, 2013, 12:53:42 AM
OK - I obviously missed the "spirit" of this thread...

No, it's OK, your code is good - fast and well readable, really :t Thanks for posting it :biggrin: It was informational test, too, now in this strange thread (@Dave - :biggrin:) we do have rolled and unrolled C (and you have provided both) and ASM versions of the code, mixed in a crazy testbeds :biggrin:

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 07, 2013, 09:15:23 AM

Quote from: habran on May 07, 2013, 05:43:20 AM
:t

Thank you, habran :t

If your OS is 32 bit, then it really seems as a best idea to run 32 bit proggies under 32 bit OS :biggrin:

Title: Re: C/C++ vs Assembler
Post by: habran on May 07, 2013, 12:19:15 PM

It is 64 bit Win 7 :biggrin:
IMO there is no penalty for running 32 on 64 but 64 is certainly faster because of 64 bit programing :t

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 07, 2013, 01:02:50 PM

That was my assumption just because your timing results seem to be smaller than other's with the same CPU :biggrin:

Title: Re: C/C++ vs Assembler
Post by: habran on May 07, 2013, 02:18:51 PM

It is Toshiba Qosmio 16 gig ram laptop with 2.3 gig i7 and 64 bit Windows 7 Home with AVX :t
I think that qWord has got the same one
It is a great toy :bgrin:

Title: Re: C/C++ vs Assembler
Post by: Antariy on May 07, 2013, 02:43:12 PM

Quote from: habran on May 07, 2013, 02:18:51 PM
It is a great toy :bgrin:

Sure it is :biggrin:

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 07, 2013, 10:46:22 PM

Hi habran,

Quote from: habran on May 07, 2013, 02:18:51 PM
It is Toshiba Qosmio 16 gig ram laptop with 2.3 gig i7 and 64 bit Windows 7 Home with AVX :t
I think that qWord has got the same one
It is a great toy :bgrin:

it's probably the Ivy Bridge, isn't it?

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 07, 2013, 11:20:17 PM

Yes Gunther, the Ivy Bridge it is :biggrin:

Quote
I already posted this before
here are specifications:

Intel® Core™ i7-3610QM Processor
(6M Cache, up to 3.30 GHz)
Specifications
Essentials
Status Launched
Launch Date Q2'12
Processor Number i7-3610QM
# of Cores 4
# of Threads 8
Clock Speed 2.3 GHz
Max Turbo Frequency 3.3 GHz
Intel® Smart Cache 6 MB
Bus/Core Ratio 23
DMI 5 GT/s
Instruction Set 64-bit
Instruction Set Extensions AVX
Embedded Options Available No
Lithography 22 nm
Max TDP 45 W
Recommended Customer Price TRAY: $378.00

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 08, 2013, 07:20:53 AM

Hi habran,

Quote from: habran on May 07, 2013, 11:20:17 PM
Quote
I already posted this before
here are specifications:

Intel® Core™ i7-3610QM Processor
(6M Cache, up to 3.30 GHz)
Specifications
Essentials
Status Launched
Launch Date Q2'12
Processor Number i7-3610QM
# of Cores 4
# of Threads 8
Clock Speed 2.3 GHz
Max Turbo Frequency 3.3 GHz
Intel® Smart Cache 6 MB
Bus/Core Ratio 23
DMI 5 GT/s
Instruction Set 64-bit
Instruction Set Extensions AVX
Embedded Options Available No
Lithography 22 nm
Max TDP 45 W
Recommended Customer Price TRAY: $378.00

an excellent machine. Runs Windows 64 as the only OS? How did you manage the new "BIOS"?

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 08, 2013, 08:44:43 AM

Thanks Gunther
I did not have to touch anything
I just installed my tools and copyed my projects :biggrin:

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 09, 2013, 02:42:42 AM

Hi habran,

Quote from: habran on May 08, 2013, 08:44:43 AM
I did not have to touch anything
I just installed my tools and copyed my projects :biggrin:

so, do you have an EFI drive, too? Or is your hard disk not over 2.2 GB size?

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 09, 2013, 05:58:31 AM

Gunther, I have C drive of 685 GB and D drive 931 GB
for my purpose it is more than enough :t

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 09, 2013, 06:54:36 AM

habran,

Quote from: habran on May 09, 2013, 05:58:31 AM
Gunther, I have C drive of 685 GB and D drive 931 GB
for my purpose it is more than enough :t

Okay, you can work with the original BIOS. My disk is over 2 TB and I had to deal with EFI. Installing different operating systems isn't pure joy.

But we shouldn't no longer discuss our hardware equipment, because the thread has another goal.

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 09, 2013, 07:01:57 AM

If I were you I would change it to two or three smaller drivers :bgrin:

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 09, 2013, 07:21:05 AM

Hi habran,

Quote from: habran on May 09, 2013, 07:01:57 AM
If I were you I would change it to two or three smaller drivers :bgrin:

good proposal and that is exactly what I've done. So the trouble did start. Here is some source for futher reading. (https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 09, 2013, 11:26:14 AM

what trouble? :icon_eek:
I'v read about UEFI and it pisses me of, WTF is tat :icon13:
I don't understand why would we need that crap :dazzled:

your machine is 3.4 gig and it suppose to be lightning fast
you pay big money to get the best thing and than you get some crap :(
that is not tolerable

I stopped baying desktops, I find laptops more suitable for everything
and if I need to go somewhere I can take it with me easy together with my mobile
internet connection

I hate seating at the desk
with the laptop I can enjoy a comfort of a recliner
I put a board over the chair and the laptop on it and a cup of a long black ;)

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 09, 2013, 07:55:40 PM

Hi habran,

Quote from: habran on May 09, 2013, 11:26:14 AM
what trouble? :icon_eek:
I'v read about UEFI and it pisses me of, WTF is tat :icon13:
I don't understand why would we need that crap :dazzled:

The point is: if you've a machine (no matter if desktop or laptop) with a hard disk over 2.2 TB, you can't manage it with the old master boot record. In that case you need UEFI with GPT. It has advantages; you need no longer logical drives, because every drive is primary. UEFI leaves a dummy MBR at your disk. But there are disadvantages, too. For example, Windows XP or Windows 7 (32 bit) are not EFI aware (very simple: no appropriate drivers). So you can't install these systems, for example, parallel with Windows 8. Moreover, if you would like to run Windows 8 and Linux in parallel, you'll need an EFI aware boot manager. I had to do one week to clear that questions. I've now Windows 7 and Linux installed (both 64 bits) and the 32 bit versions as virtual machines. My boot manager is GRUB 2.

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 09, 2013, 08:11:40 PM

Windows 8 sucks :(
I am happy with Windows 7, 64 bit
If I want to bye new machine I will wait Windows 9 probably ;)
thanks for clarification Gunther :t

Title: Re: C/C++ vs Assembler
Post by: dedndave on May 09, 2013, 10:21:40 PM

i did manage to find a GPT driver from Paragon that will work under XP :P

the question i have is....
when it comes to hard drives, how big is too big ?
personally, i think if you exceed 2 Tb, then you have too many eggs in one basket, anyways - lol
let's face it - when the drive crashes, how much stuff do you want to lose

better to have (3) 1 Tb drives that work and 1 that doesn't
than to have a 4 Tb drive that doesn't

Title: Re: C/C++ vs Assembler
Post by: habran on May 09, 2013, 10:34:05 PM

Quote from: dedndave on May 09, 2013, 10:21:40 PM
personally, i think if you exceed 2 Tb, then you have too many eggs in one basket, anyways - lol

I agree with you totally :t

Title: Re: C/C++ vs Assembler
Post by: Vortex on May 10, 2013, 03:23:46 AM

Quote from: dedndave on May 09, 2013, 10:21:40 PM
better to have (3) 1 Tb drives that work and 1 that doesn't
than to have a 4 Tb drive that doesn't

I agree with you :t

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 10, 2013, 03:54:29 AM

Hi Erol,

Quote from: Vortex on May 10, 2013, 03:23:46 AM
Quote from: dedndave on May 09, 2013, 10:21:40 PM
better to have (3) 1 Tb drives that work and 1 that doesn't
than to have a 4 Tb drive that doesn't

I agree with you :t

the point is: the 4 TB hard drive will work with UEFI and GPT. Unfortunately, that's the future because Intel, AMD, Microsoft, Hewlett-Packard and other big players are the "fans" of that idea.

Gunther

Title: Re: C/C++ vs Assembler
Post by: habran on May 10, 2013, 05:52:49 AM

*&^%$#@!~ :(

Title: Re: C/C++ vs Assembler
Post by: Vortex on May 10, 2013, 06:02:07 AM

Hi Gunther,

Dave's approach is very logical. A high capacity hard drive can be a big risk. Splitting the data across multiple drives is more safe.

Title: Re: C/C++ vs Assembler
Post by: Gunther on May 10, 2013, 08:20:10 AM

Hi Erol,

Quote from: Vortex on May 10, 2013, 06:02:07 AM
Dave's approach is very logical. A high capacity hard drive can be a big risk. Splitting the data across multiple drives is more safe.

no doubt about it. But will it be possible in the future to buy smaller hard disk. My desktop PC has a 2.2 TB hard disk. I needed it, because I want to learn about the new Ivy Bridge architecture. So, what now?

Gunther

Title: Re: C/C++ vs Assembler
Post by: hutch-- on May 10, 2013, 09:42:03 AM

I have had the solution for years, multi-partition machines with 4 hard disks. My now old Core2 quad has 2 x 1 tb drives and 2 x 2 tb drives split into 12 partitions, the first 2 have 259 gig partitions and the last 2 have 500 gig partitions. As a safety margin I keep another XP machine that will read any of the disks from the quad if it ever goes bang and i can also read the disks on one of the old Win2000 machines.

If the i7 64 bit box ever goes bang I am in trouble as they have a different disk format that a 32 bit OS cannot read.

The MASM Forum

General => The Laboratory => Topic started by: Manos on May 04, 2013, 04:11:50 AM