News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Optimisation

Started by sinsi, February 14, 2019, 04:26:42 PM

Previous topic - Next topic

sinsi

There was an nice basic series of blogs from Raymond Chen about the 386.
A couple of things were interesting about optimisations that I use.


;I usually use these, because the instructions are shorter
83 66 0C 00           and     dword ptr [esi+0Ch], 0  ;3 bytes shorter but memory read-modify-write
83 CF FF              or      edi, 0FFFFFFFFh         ;2 bytes shorter but creates a (false) dependency

;These are apparently better
C7 46 0C 00 00 00 00  mov     dword ptr [esi+0Ch], 0  ;simple memory write
BF FF FF FF FF        mov     edi, 0FFFFFFFFh         ;no dependency


I also remember in a discussion about XOR r,r vs SUB r,r someone said thet Intel/AMD have
optimised XOR to be more efficient at zeroing a register. Does anyone have any information?
🍺🍺🍺

jj2007

Agner Fog, Optimizing subroutines in assembly language:
QuoteA common way of setting a register to zero is XOR EAX,EAX or SUB EAX,EAX. Some processors recognize that these instructions are independent of the prior value of the register.

I remember some AMD manual where they wrote explicitly that xor reg32, reg32 is the "officially recommended" way to zero reg32, but I can't find the reference.

Re and dword ptr mem, 0: Yes it is true that under certain conditions, this could be slower. Say, in a tight loop with Millions of iterations. In all other conditions, the and is 3 bytes shorter; therefore, if the tight loop that follows a few lines after is a tiny bit too long to fit in the instruction cache, then the and version is probably much faster than the longer mov.

hutch--

So much for these optimisations, the hardware (processor version) dictates how fast certain mnemonics are. LEA was very fast on a PIII, on a PIV you tried to avoid it. SSE was very ordinary on a PIV, from Core2 upwards its a lot faster. I remember most of the old Intel based optimisations, use ADD or SUB, not INC or DEC, zero a register with XOR although SUB reg, reg has always worked OK. Jump reduction when the jumps would be taken can sometimes deliver speed improvements but fall through jumps rarely ever matter.

One area that I disagree about is the relevance of instruction bytes count. In the 16 bit MS-DOS era you used short jumps, from 32 bit onwards it just does not matter. Much the same comment on the pursuit of shorter encodings, x86 are instruction munchers, not byte counters. If you have a very long algorithm that risks exceeding the cache, a re-design is a better choice than short encodings, close branching to sub procedures tends to keep more code in the cache than long tangled messes. I remember years ago some guy that inlined a massive number of identical instructions to avoid loop code with branching. It was as slow as a wet week.

The magic rule with encodings is to TIME it in REAL TIME. Forget tricky dicky smart arse cycle counts, duration is the one that matters. Then to your joy you run it on a different processor and it could be either faster relative to other techniques or slower. Such are the joys of mixed hardware programming.  :P

Siekmanski

Quote from: hutch-- on February 14, 2019, 10:06:56 PM
Then to your joy you run it on a different processor and it could be either faster relative to other techniques or slower. Such are the joys of mixed hardware programming.  :P

But if you are an over the top programmer, there is always the possibility to write the fastest algorithm for all the different available processor types.
Big chance you'll end up in a madhouse.  :badgrin:
Creative coders use backward thinking techniques as a strategy.