The MASM Forum
General => The Soap Box => Topic started by: aw27 on April 19, 2017, 03:42:37 PM
-
Yesterday, I published an article at Code Project. "Need for Speed - C++ versus Assembly Language" (https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language)
-
Hi Aw.
Out of the thread. I´m updating your code on Transposing matrix to optimize it. The version is a bit slow, due to the heavy usage of push/pop, add instructions. I´m trying to rewrite it to work at the proper speed.
So far, i succeeded to double the speed gain, removing the loops and addition operations at edi. I´m currently trying to make the same for esi and later will review the remainder. loops.
On my PC the old speed was something around 438 ns on the original version. Now it is only something around 220 and i believe it can reach something around 130 if i succeed to make the same for esi.
Once i finish i´ll post it for you.
-
My guys will avenge this heresy!! C++ cannot beat a really optimized asm.
-
Out of the thread. I´m updating your code on Transposing matrix to optimize it.
It is an excellent exercise. :t
And there is also room to optimize the Fast Matrix Flip.
-
LordAdef is right, no compiler can beat 100% well optimized assembly code.
Is this a challenge ?
It's the same story over and over again on the c++ forums.
- you must be mad programming in asm.
- asm is ancient and dead.
- nobody uses it anymore.
- why using it if a compiler does a greater job.
- etc.
Some of those guys are really sneaky and provoke you to write a faster routine for their own use. :biggrin:
-
LordAdef is right, no compiler can beat 100% well optimized assembly code.
I used to hear the same think about chess. No computer will ever be able to beat a chess grandmaster because computers have no ideas, no creativity, no positional sense, etc.
Now, every grandmaster is beaten by his smartphone running a free chess app with ELO 3000.
Is this a challenge ?
Sure, I know you are good at optimization, give it a go!
-
I used to hear the same think about chess. No computer will ever be able to beat a chess grandmaster because computers have no ideas, no creativity, no positional sense, etc.
I understand the logic, but the cases are a bit different: Chess computers win with brute force. Now I wouldn't exclude that one day compilers test various options for an innermost loop to find the fastest encoding, but I wouldn't bet on it 8)
-
That's comparing apples with oranges.
A chess player has to respond to the actions of his opponent. ( human or computer )
A programmer already knows how the CPU will respond.
Sorry, no time to take the challenge.
The first thing i would do, align the code loops and align the data memory and move the memory allocations outside the loops.
Or better allocate the needed amount of memory once.
-
hello sir aw27, I read your article.
Processors can beat humans on a brute force way, sequential logic, but when we start inserting paralell logic we can beat machines.
Remove chess program database (chess openings) and you see how easy is to win any chess program with high ELO level. Chess programs use database created by humans, by games between humans IM or GM. Have more chess possible movements than stars on our universe, so, how much time a computer chess program will spend to only do the five first moves without a database and using only chess rules as support?
I'm with opinions from persons here, c++ code can't beat an experienced assembly programmer. If we consider that a compiler was done by a lot of persons, going from math point of view reaching opcodes, let's do the same, let's join experienced assembly programmers to work as a team and this way I know the final answer. I'm saying this because we have much more freedom against c or c++ programmer.
-
Chess computers win with brute force.
The problem is that they don't win anymore by brute force alone. It would be impossible because there are trillions of possibilities.
-
Remove chess program database (chess openings) and you see how easy is to win any chess program with high ELO level.
Grandmasters have also encyclopedic knowledge on chess openings. On the middle game, is where computers take the lead, not to mention on finals.
-
Yesterday, I published an article at Code Project. "Need for Speed - C++ versus Assembly Language" (https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language)
Nice work, you put a lot of effort into it :t
Typo in point 7): In order words,... you meant in other words, I suppose.
I undertook a few optimization steps with the ASM source code and was able to improve the assembled performance by more than 30%.
...
I have seen the ASM listing produced by the C++ compiler and some parts are just mind blowing - nobody would believe a human would code that way (if he does the code would be almost impossible to maintain). The compiler uses every trick under the table in an automated way - difficult to beat. It knows about everything about how the pipelines and predictive branching work, it reorders of instructions, does loop tiling and uses cache-oblivious algorithms.
Of course, you have a point there. On the other hand, if you take the compiler's assembler listing and play around with it, you may tickle out a few % more... and prove again that assembler is faster ;-)
Just joking. Your article clearly shows that a C++ compiler can beat assembler. However, truth is also that in the Lab we have beaten the CRT many times, typically by a factor 2-3; and one would assume that the CRT developers use the best compilers, no?
-
Typo in point 7): In order words,... you meant in other words, I suppose.
Thanks. I have a few more to fix.
if you take the compiler's assembler listing and play around with it, you may tickle out a few % more... and prove again that assembler is faster ;-)
In a real case I would feel tempted to compile the compiler's assembler listing :biggrin:
CRT many times, typically by a factor 2-3; and one would assume that the CRT developers use the best compilers, no?
The initialization code looks at lots of things we typically don't.
-
The article is really well written, congratulations.
But the the asm code can be further optimized. Marinus mentioned the memory allocation, but there are those PUSH-POPs, there is the shl which is slow. If I'm not mistaken lea is slower too and could be exchanged to mov - offset (can that be done in 64?).
What I'm saying is that a guy like Marinus, Johen and others could make this asm run faster than the C++ one. In this case, the point of the article looses ground.
Another thing to consider is, if one refers to the article's title, when you really do the C++ thing (oop, those C++ crazy abstractions, etc...), C++ gets increasingly slower and slower. It's slower than pure C.
My only point here is that you are comparing a fully optimized C++ code versus a not fully optimized asm.
-
This much I have seen with comparisons of this type, once the optimisations on both sources are done if the code is competent in both, if the same instructions are used in both, the difference is negligible. Where the real action will be is in designing better algorithms and parallel multi-thread applications as SSE and AVX instructions are not particularly responsive to minor twiddling like the older traditional integer instructions. You do the obvious things like not repeatedly running code that should safely exist outside intensive loops, align all of the data to its required optimums and you can have a fiddle with code alignment if you think it can make it a bit faster but the action here will always be better design, not close range twiddling.
-
deleted
-
You should add a test for ML.exe ;)
-
deleted
-
Yep, this is what I meant :biggrin:
RichMasm assembles with asmc in less than 600 ms on my i5, as compared to 680 for JWasm and 1230 for ML :t
(Wow, that thread was long ::))
-
I have a vested interest in this topic that continually seems to pop up from time to time (I wonder why?)
On the scholarly literature I've been able to accumulate the following conclusions have been drawn:
1) Handwritten assembly - properly done (the REAL POINT) - will ALWAYS out perform the compiler (ANY) for the same algorithm
2) Loop structures (unrolling) are a critical factor that need to target hardware L1 cache hierarchies - compilers are really bad at this
3) Data locality cannot be easily predicted by compiler heuristic transforms - and thus will always produce sub-optimal code
4) Compilers are really BAD at generating SIMD/AVX code that takes full advantage of the instruction set
- in one large study of compilers, published in 2013, it was found that on average compilers miss 60% of the opportunity vectorize
-
but there are those PUSH-POPs, there is the shl which is slow.
You appear to know a lot about these things.
If I'm not mistaken lea is slower too and could be exchanged to mov - offset (can that be done in 64?).
You never heard that lea is an handy fast arithmetic calculator? I am using it like that, not to load an effective memory address.
What I'm saying is that a guy like Marinus, Johen and others could make this asm run faster than the C++ one. In this case, the point of the article looses ground.
I am also expecting that either of them or anyone else will restore my faith in a fair World. Compiler produced spaghetthi code should not perform better than well written hand made ASM.
-
... will restore my faith in a fair World. Compiler produced spaghetthi code should not perform better than well written hand made ASM.
Your article clearly shows that a C++ compiler can beat assembler.
Your article is good, really. And it shows that a compiler can beat us. It does not prove, though, that it will beat us all the time. How easy will it be to write an article that, on the basis of one particular case, "proves" that the compiler can be beaten? Let's not be too superficial...
P.S.:
Continuing the saga on Matrix Transposing...
This is a solution for transposing matrixes of any size, squared or not. It supports as well small matrixes or matrixes with less than 4 rows or 4 columns.
Which compiler produced that ultra-fast assembler code you are showing there...?
-
olá!
You appear to know a lot about these things.
Not really. I'm fairly new to asm myself. but I've been doing a really intensive training and learning from whatever source I can. As I'm currently writing I prog in asm myself, I happened to benchmark shl and can confirm it's a rather slow option, if you are aiming for speed.
You never heard that lea is an handy fast arithmetic calculator? I am using it like that, not to load an effective memory address.
I do! But I confess I only passed my eyes on the lea instructions. Sorry. Nevertheless, it's a place to check the clock and maybe see if it's the fastest option.
I'm very curious to see how you and these guys can optimize this algo and how asm will react afterwards.
there is also a suggestion from Hutch I read sometime ago one should take into consideration: building the asm side in a dedicated ide, not in VS.
-
Which compiler produced that ultra-fast assembler code you are showing there...?
You wont like the answer.. :badgrin:
-
How easy will it be to write an article that, on the basis of one particular case, "proves" that the compiler can be beaten? Let's not be too superficial...
If you have a good case it will be easy. This is not a superficial answer.
Which compiler produced that ultra-fast assembler code you are showing there...?
Did I mention it was ultrafast? But you can always improve and show the outcome.
-
I happened to benchmark shl and can confirm it's a rather slow option, if you are aiming for speed.
Use "mul" instead, and show the benchmarks.
-
Which compiler produced that ultra-fast assembler code you are showing there...?
You wont like the answer.. :badgrin:
Probably, I will start disregarding your provocations.
-
Did I mention it was ultrafast? But you can always improve and show the outcome.
The point is that, as far as I know, you hand-coded it in assembler ;)
I happened to benchmark shl and can confirm it's a rather slow option, if you are aiming for speed.
Use "mul" instead, and show the benchmarks.
http://masm32.com/board/index.php?topic=6092.msg64629#msg64629
-
Did I mention it was ultrafast? But you can always improve and show the outcome.
The point is that, as far as I know, you hand-coded it in assembler. Right?
Ah, "assembler is fun" as you say under your logo and I never though about doing it in C/C++. I don't know whether it would be faster or not in this case (probably, not in this case).
-
Which compiler produced that ultra-fast assembler code you are showing there...?
You wont like the answer.. :badgrin:
Probably, I will start disregarding your provocations.
It's not a provocation aw27, it's me actually teasing JJ!!! He is Microsoft's enemy number 1. I never did anything to you... Why is that?
-
There is an obvious elephant in the room here that everybody is desperately trying to avoid and that is processor family differences. Just as an example, the instruction LEA was genuinely fast on a PIII but turned into a lemon on a PIV. The next generation Core2 series hardware was a little kinder to LEA and the range of i7 hardware does not have a problem with it. SSE got a lot faster with the Core2 and later series at the expense of simpler integer instructions getting slower as silicon was being pointed at later instructions.
Then for each family of processor you have a range of price driven variants that vary with their power design, cache size and frequency rate and collectively all of these many variants make the notion of a single piece of code being faster than another nonsense. The best you will get on similar age processors is a decent average across similar hardware and even that is dodgy.
The vast majority are messy in data and caching and much of the speed related advantages of one algo over another are wasted when the rest of the app is necessarily messy in how it works. It is not to say that its not worth the effort to manually optimise code but it is nowhere as simple as it is being made out here.
With this 6 core Haswell I work with at the moment, it is a nominally a 3.3gig processor but check the task manager and most of the time it sleeps in noddy mode at about 1.2 gig and only when it is loaded does it come up to speed. I have just recently tweaked my old i7 860 which was an overclockers toy some years ago by adding memory to it (8 up to 16 gig) then upped the clock from 2.8 to 3.5gig and it benchmarks faster than the Haswell.
-
I never though about doing it in C/C++. I don't know whether it would be faster or not in this case
Well... can you really resist trying that one with MSVC?
;)
-
I never though about doing it in C/C++. I don't know whether it would be faster or not in this case
Well... can you really resist trying that one with MSVC?
;)
MSVC is faster for small matrixes. Tends to be slower and slower as the size grows.
But the ASM is not yet optimized.
-
I have tried to build your code... first, Transpose.vcxproj launched VS Community, which took OVER THREE MINUTES to open. CRAP. Then it asked me to login to my M$ account (which I refuse to have) because the trial period is over. Redmond, this CRAPWARE was supposed to be FREE, right?
So I tried my commandline setup for MSVC: "c:\program files (x86)\microsoft visual studio 10.0\vc\include\codeanalysis\sourceannotations.h(194): error C2059: syntax error: '['"
[RANT]
Great. And it seems that all my previously working sources show the same error. Thank you, VS Community, for introducing new "features".
Sorry, I give up on C/C++. Almost every time I put hand on C code, it ends up with endless searches on the web for somebody who solved the mystery of missing header files, or (in this case) header files that have "syntax errors" although I definitely never touched them. Not to mention the numerous attempts to load M$ "projects" which fail miserably because the current MSVC is not able to read the old obsolete crap that was saved in the previous version two years ago. Visual Crap just stinks. Kudos to Hutch - Masm32 works. Same to Pelle Orinius, btw - his C compiler works, too.
This afternoon I wasted over one hour trying to connect a phone to my PC with Bluetooth. Incredibly complicated, Windows help completely useless, it just sends you in circles, I gave up in the end. How could this "OS" survive so many years???
A helicopter was flying around above Seattle when an electrical malfunction disabled all of the aircraft's electronic navigation and communications equipment. Due to the clouds and haze, the pilot could not determine the helicopter's position and course to fly to the airport. The pilot saw a tall building, flew toward it, circled, drew a handwritten sign, and held it in the helicopter's window. The pilot's sign said "WHERE AM I?" in large letters. People in the tall building quickly responded to the aircraft, drew a large sign, and held it in a building window. Their sign read "YOU ARE IN A HELICOPTER." The pilot smiled, waved, looked at his map, determined the course to steer to SEATAC airport, and landed safely. After they were on the ground, the co-pilot asked the pilot how the "YOU ARE IN A HELICOPTER" sign helped determine their position. The pilot responded "I knew that had to be the Microsoft building because they gave me a technically correct, but completely useless answer."
[/RANT]
-
MS C compiler is good, standard headers are just c...
For that reason i use WDDK headers or just without those.
C is so flexible :t
EDIT: reminder
cl.exe -GS- -Zl -fp:fast -arch:SSE2 -d2noftol3 -O2 N4S.c DeterminantC.c -link -nocoffgrpinfo
-
MS C compiler is good, standard headers are just c...
I didn't ask MS C to include sourceannotations.h :(
-
I have tried to build your code... first, Transpose.vcxproj launched VS Community, which took OVER THREE MINUTES to open. CRAP. Then it asked me to login to my M$ account (which I refuse to have) because the trial period is over. Redmond, this CRAPWARE was supposed to be FREE, right?
So I tried my commandline setup for MSVC: "c:\program files (x86)\microsoft visual studio 10.0\vc\include\codeanalysis\sourceannotations.h(194): error C2059: syntax error: '['"
Don't forget that you can always delete it and have some peace of mind in the future.
-
I confess I am no fan of the musical chairs that Microsoft play with their C/C+ versions. I have the source code from the SAPI 5 SDK for the app that runs the speech engine and I also have a perfect copy of the VC2003 environment that built everything from the old SDK onwards but when I tried to build the TTS app, it needed some AFX crap so I thought PHUKIT and tweaked the original executable in ResourceHacker, put a manifest into it so it looks like a modern app, redrew the dialog interface so that it was more or less useful and it is now worth using.
C was supposed to be portable, something that Microsoft have deliberately broken to keep the suckers dependent.
At least with Japheth's JWASM if you paddled through his makefile you could get the options and build it with a batch file but it WAS written to be portable. Built it in Pelle's C compiler as well.
-
At least with Japheth's JWASM if you paddled through his makefile you could get the options and build it with a batch file but it WAS written to be portable. Built it in Pelle's C compiler as well.
For JWASM, I did not even look at the makefile, in the editor I selected all .C and .H files and it compiled fine with VS 2015 both for 32 and 64-bit.
-
Japheth used to recommend the VC2003 toolkit as it had better libraries than the later versions. I got it to build in VC10, VC2003 and Pelle's C but never with an IDE from Microsoft, always built with batch files. The VC2003 versions were smaller and faster.
-
Japheth used to recommend the VC2003 toolkit as it had better libraries than the later versions. I got it to build in VC10, VC2003 and Pelle's C but never with an IDE from Microsoft, always built with batch files. The VC2003 versions were smaller and faster.
Something older than VS 2005 has no interest for me now, support for 64-bit started there. In case of need I have DDKs as old as Windows Nt 4, or SDks not distributed anymore.
-
It worries me much less with tools and utilities than it would with specific user apps, in most cases non-UI tools applications that are primarily single thread don't go any faster in 64 bit than 32 bit. Its where large memory is an advantage that 64 bit shines when you can routinely allocate multi-gigabyte blocks and multi-thread the processing of large amounts of memory.
-
test programs compiled with version 19.10 for x86 and x64
and x86 test programs compiled with versions 19.00 and 19.10
-
don't go any faster in 64 bit than 32 bit.
When the number of 64-bit CPU registers help reduce data in and out from memory it will a lot .
-
Like x64 fastcall calling convention, many algorithms use less that 3 args on 32 bit and can be run as FASTCALL. You may be assuming that everything is done in 256 bit operations but many tasks are unaligned messes with mixed byte, word and dword data that defy later larger and faster registers. Then you have the code size difference, smaller 32 bit code uses less cache that equivalent 64 bit code and smaller data sizes load faster than big ones. Simple case is some code is faster in 32 bit, some other code is faster in 64 bit, anyone who is familiar with writing 32 bit assembler code already know how to be efficient with usable registers, the differences between inner and outer loop code and instruction choice.
-
Like x64 fastcall calling convention, many algorithms use less that 3 args on 32 bit and can be run as FASTCALL.
Except for very small functions, FASTCALL will end making the code slower.
The reason is that you will have to save the registers content somewhere inside the function.
Before the call you have to load the registers with data and inside the function you will have to save the registers content somewhere because you need the registers for other things. A waste of cycles, it's like, put the car keys in the pocket to cross the room and place them in another table.
The same applies to x64, although it has more registers to play with, it is not called FASTCALL anymore - there is no other. Ah yes, Vectorcall, but the same problem.
-
This is indeed an unusual comment, if you don't use register passing ALA the Microsoft Application Binary Interface "rcx rdx r8 r9" you are left with passing data by globals or old slower style STDCALL stack passing with pushes and pops. Now of course nothing is stopping you from pre-loading a number of AVX registers and calling a procedure that will use them but you must get the arguments for a procedure some how and it does not happen by magic. In 32 bit you used the Intel Application Binary Interface which was a standard PUSH/CALL technique and you can emulate FASTCALL with up to 3 registers to keep the stack overhead down if it is a short leaf procedure.
-
This is indeed an unusual comment,
Not unusual.
I made a quick search on google and there was someone with the same idea:
"You don't gain anything by passing in registers if the called function immediately needs to spill everything out into memory for its own calculations."
Another one:
"How fast is this calling convention, comparing to __cdecl and __stdcall? Find out for yourselves. Set the compiler option /Gr, and compare the execution time. I didn't find __fastcall to be any faster than other calling conventons, but you may come to different conclusions."
old slower style STDCALL stack passing with pushes and pops.
STDCALL is not slow anymore, it is as fast as CDECL and, in my opinion, in real life, not school class examples, both are faster than FASTCALL. Sound weird, but this the reason not to be widespread.
-
This tends to be why assembler programmers benchmark techniques rather than search the internet for quotations.
FASTCALL in 64 bit is specification.
mov rcx, handle
mov rdx, wmsg
mov r8, wparam
mov r9, lparam
call SendMessage
mov retval, rax
In 32 bit STDCALL is specification.
push lparam
push wparam
push wmsg
push handle
call SendMessage
mov retval, eax
Its a simple fact that registers are a lot faster than memory and much of the design of 64 bit FASTCALL was to reduce the call overhead for the vast majority of function calls that use 4 or less arguments. When you don't need to twiddle the stack you reduce overhead and pick up speed. The other factor of course is "does it matter" when you are calling high level code in either libraries or DLL system functions.
Being able to save a few pico-seconds making a MessageBox() call seems to be the achievement of much modern high level code design where the benchmarking approach puts the effort where it matters, in high level code you pursue clarity and maintainability where in low level code you design and benchmark to get the speed up. You need to do more than just twiddle compiler options, a dis-assembler does not tell lies.
-
deleted
-
This tends to be why you have a variety of techniques, stack frames for high level code that uses many arguments and local variables and no stack frame for low argument counts and direct register passing to cut overhead and improve speed.
-
deleted
-
A helicopter was flying around above Seattle when an electrical malfunction disabled all of the aircraft's electronic navigation and communications equipment. Due to the clouds and haze, the pilot could not determine the helicopter's position and course to fly to the airport. The pilot saw a tall building, flew toward it, circled, drew a handwritten sign, and held it in the helicopter's window. The pilot's sign said "WHERE AM I?" in large letters. People in the tall building quickly responded to the aircraft, drew a large sign, and held it in a building window. Their sign read "YOU ARE IN A HELICOPTER." The pilot smiled, waved, looked at his map, determined the course to steer to SEATAC airport, and landed safely. After they were on the ground, the co-pilot asked the pilot how the "YOU ARE IN A HELICOPTER" sign helped determine their position. The pilot responded "I knew that had to be the Microsoft building because they gave me a technically correct, but completely useless answer."
HAHAHA! This is really funny :lol: :biggrin:
-
I updated the article "Need for Speed - C++ versus Assembly Language" (https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language), now it includes C# and Free Pascal on the run as well. Now, ASM wins hands down in both cases, particularly for C#. :greenclp:
-
:greenclp: :greenclp: :greenclp: :greenclp:
You have won CodeProject
Best C++ Article of April 2017 First Prize.
Type: Article
Location: Need for Speed - C++ versus Assembly Language
https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language
CodeProject Mug (http://www.cafepress.co.uk/codeproject.1302986) from CodeProject. Value: $14
-
Congrats, José :t
That is a big success, and worth much more than the mug ;)
-
PPS. looking (after you conveniently ignored my initial post - seesh science must hurt, or else your preaching to the converted) at
real (therefore repro-able) science = metrics
It has been established that compilers
cannot exceed the standard asm coder. Please rectify your
popularist code project award winning post.
-
Gonna need a pic of you standing majestically with the codeproject mug and staring off into the distance.
-
That is a big success, and worth much more than the mug ;)
I had no idea it was worth anything. :icon14:
-
Gonna need a pic of you standing majestically with the codeproject mug and staring off into the distance.
And with a crown in my head, of course.
-
Please rectify your popularist code project award winning post.
Yes, Sir! :icon_redface:
-
I'm sure if you get rid of the high level constucts (.If .. then.. else.. etc), you see a vast improvement in speed ;)
As the C++ compiler uses every trick in the book.. try the same for Asm ;)
-
I'm sure if you get rid of the high level constucts (.If .. then.. else.. etc), you see a vast improvement in speed ;)
I've never seen such an improvement. Can you post an example?
-
I haven't got an example offhand, but Dedndave (or someon else) posted something on this topic a few years back.
Their example improved the 'goto' by one instruction per IF IIRC - so this was my thought as AW27's asm example had a few levels of IFs.
A couple million extra instructions could make a difference in those totals.
-
If you are serious about such things, you use a table of labels and reach every option with a couple of instructions.
-
I am always curious to see highly optimised code, therefore some time I created the CodeSize macro. The attached testbed is purest Masm32 code, no MasmBasic, promised. All you have to do is write code that is more efficient than the built-in HLL stuff, and add a label before and after. Example:
Man_Repeat_s:
@@:
dec ecx
jne @B
Man_Repeat_endp:
CodeSize Man_Repeat
Output:
3 bytes for Man_Repeat
Really, extremely easy to use. And it's fun to outperform the HLL stuff :t
-
I'm sure if you get rid of the high level constucts (.If .. then.. else.. etc), you see a vast improvement in speed ;)
As the C++ compiler uses every trick in the book.. try the same for Asm ;)
The variation I assembled with ML64, does not contain high level constructs and has no noticeable performance differences. But I agree that the high level constructs contain more instructions.
-
JJ: Why not
mov ecx, 9
Loop_s:
loop Loop_s
Loop_endp:
Epa! (forget the question)
146665 cycles for Loop
98486 cycles for jne
-
But I agree that the high level constructs contain more instructions.
For example?
JJ: Why not mov ecx, 9
Loop_s:
loop Loop_s
Loop_endp:
Epa! (forget the question)
146665 cycles for Loop
98486 cycles for jne
Valid example, thanks, it's indeed one byte shorter ;-)
I've added your suggestion to testbed version 2. Including the HLL equivalent :bgrin:
-
I always laugh at some of the notions of speed, would it really matter if your MessageBoxA was a few picoseconds faster than someone elses ? You put the effort where it does matter, where you have processing bottlenecks, where time critical code is holding up the works, the rest is write it clearly, make it maintainable and reliable. Then there is the choice of algorithm, a sloppy quick sort will outperform a brilliantly optimised bubble sort, don't waste your effort on the wrong idea.
-
I absolutely agree, Hutch. 100% 8)
The last question was, however, whether HLL produces longer or slower code. And I am still waiting to see an example that confirms this statement. One example would be enough.
-
I always laugh at some of the notions of speed, would it really matter if your MessageBoxA was a few picoseconds faster than someone elses ?
Although I agree - I have another reason for doing full ASM apps. Wait for it.... TA DAAAAA = Size !
Yes, size does matter - especially when you want to create cross hardware/software SKU -platform support / installs, network-comms etc.
It does not detract either that: Smaller, generally (yes, I know that's relative) means faster.
RE: smaller in-memory footprint, cache hit probability, hdd binary size reduces load times etc....
[EDIT: Note to self - do not use "etc" so much - either list all or list none]
-
The reason that sticks in my head is POWER, size is useful in that it allows you to do things that would be clunky and complicated in a HLL not designed for the purpose which means that you can trade size for speed when your code is small enough to get away with it but below all of the practical considerations, it is the architectural freedom to design what you like in whatever way you like that is the real reason why I write in x86 assembler.
-
Other reason, that i like is the close relation with the circuits (the microprocessor). Electronic stuffs are fascinating i think, so write in assembly is the best for controlling by software all that circuitry. :bgrin:
-
While I agree with you Felipe, unfortunately, everyone is using C to program their microprocessors these days. Very disheartening.
-
While I agree with you Felipe, unfortunately, everyone is using C to program their microprocessors these days. Very disheartening.
I noticed the same, especially since the internet of things modules such as arduino and stuff. Most of them use the c++ compiler that comes with it. The young people don't even know what an assembler is. Here in europe there are still many people programming microcontrollers in assembly. There are great forums in Germany.
One thing is for sure, assembly has a very big advantage in speed and size over C/C++ for the very small rams of the tiny microcontrollers.
-
For the esp8266 I've been working with, everything is in C, I haven't seen an assembler for it. So I'm using ZBasic which converts basic to C for the esp.
-
Don't be so negative: assembler is on the rise again: https://www.tiobe.com/tiobe-index/
-
With an Atmega microcontroller programmed in assembly you can send and receive the AT commands from and to the esp8266 module.
This is not hard at all, because the ATmega has a UART on board.
Here is an example in assembly how to use the UART of an ATmel ATtiny2313 microcontroller:
.include <tn2313def.inc> ; ATtiny2313 definitie bestand
.equ F_CPU = 11059200 ; Hz "cpu kloksnelheid definitie voor Wacht-macro"
; Instellingen RS232 communicatie
.equ BaudRate = 115200
.equ RxBufferLengte = 64 ; Inhoud van de ringbuffer in bytes
.org 0x0000 ; Code laad adres 0x0000
rjmp Init ; Maak een Relatieve sprong naar Init
.org URXCaddr
rjmp Receive_Byte ; Ontvangst interrupt
.DSEG
RS232_Buffer: .byte RxBufferLengte
RS232_BufferSchrijfPositie: .byte 1
RS232_BufferLeesPositie: .byte 1
RS232_BufferLeesPositieNieuw: .byte 1
.CSEG
Init:
ldi r16,low(RAMEND) ; Initialisatie STACK
out SPL,r16
cli
rcall RS232_Init ; RS232 protocol initialisieren
sei
Start:
; rcall PrintRS232data
; ldi r16,0x4F ; zend "O"
; sts RS232_Karakter,r16
; rcall Send_Byte
; ldi r16,0x4B ; zend "K"
; sts RS232_Karakter,r16
; rcall Send_Byte
rjmp Start
RS232_Init:
eor r16,r16
sts RS232_BufferSchrijfPositie,r16
sts RS232_BufferLeesPositie,r16
sts RS232_BufferLeesPositieNieuw,r16
sts RS232_NieuweRegel,r16
ldi r16,(0<<U2X) ; geen dubbele snelheid
out UCSRA,r16
; Stel baud rate in:
ldi r16,High(F_CPU/(16*BaudRate)-1)
out UBRRH,r16
ldi r16,Low(F_CPU/(16*BaudRate)-1)
out UBRRL,r16
; Stel ontvanger (RXEN), zender (TXEN) en ontvang_interrupt (RXCIE) in.
ldi r16,(1<<RXCIE)|(1<<RXEN)|(1<<TXEN)
out UCSRB,r16
; Stel transfer formaat in: 8 data (UCSZn1 = 1 & UCSZ0 =1), 1 stop bit (USBS = 0, == 1 stopbit)
ldi r16,(1<<UCSZ1)|(1<<UCSZ0)|(0<<USBS) ; 8N1
out UCSRC,r16
ret
Receive_Byte:
push r16
in r16,SREG
push r16
ldi YH,high(RS232_Buffer)
ldi YL,low(RS232_Buffer)
lds r16,RS232_BufferSchrijfPositie
add YL,r16 ; zet positie in RS232_Buffer
mov r17,r16
inc r17
andi r17,RxBufferLengte-1 ; blijf in de "Ringbuffer"
sts RS232_BufferSchrijfPositie,r17 ;
in r16,UDR ; ontvangen karakter
sts RS232_Karakter,r16 ; opslaan voor echo
st Y,r16 ; sla karakter op in RS232_Buffer
cpi r16,0 ; testen op einde commando
brne geen_nieuwe_regel
sts RS232_BufferLeesPositieNieuw,r17 ; nieuwe positie voor volgende regel
ldi r16,1
sts RS232_NieuweRegel,r16 ; meld nieuwe regel aan ( we zitten in een Interrupt )
geen_nieuwe_regel:
pop r16
out SREG,r16
pop r16
reti
Send_Byte:
sbis UCSRA,UDRE ; wachten tot zendbuffer leeg is.
rjmp Send_Byte
lds r16,RS232_Karakter
out UDR,r16
ret
-
Don't be so negative: assembler is on the rise again: https://www.tiobe.com/tiobe-index/
It's just about awareness....
Once the younger generation discover assembler ... :eusa_boohoo: