Working Below 8-Bit Data

jj2007 · September 22, 2015, 04:49:34 AM

Quote from: NyattaFaux on September 22, 2015, 04:00:45 AMI had found them in MASM32 under Help>>Opcodes Help.

Those are a little bit outdated - the original opcodes.hlp file was last edited in April 1998 8)

Quoteit was my understanding 1 clock is the minimum?

Yep, that was also my understanding until I was forced, by valuable members^TM of this Forum, to accept that the story is a little bit more complex. If you want to dig deeper, read here Paul A. Clayton's answer, the one that starts with

QuoteThe Ivy Bridge microarchitecture of the i7 3630QM can only commit 4 fused µops per cycle, though it can begin execution of 6 µops per cycle.

rrr314159 · September 22, 2015, 08:19:07 AM

Hi NyattaFaux,

code optimization is complicated. The biggest name in the subject is Agner Fog, google him with "assembler optimization". You can get multiple instructions executing in parallel but it's not as simple as just using different registers. Also if 100 instructions take 17 cycles then obviously one averaged .17, but that's not to say if you execute one it will take .17; instead it would probably take about 1 whole cycle; because the 100 have been pipelined, partially parallelized.

It's very sensible to read such documents (and Paul Clayton) but you can't optimize purely theoretically; you have to code and time alternatives. It also varies per computer, OS, environment, and (it seems) phase of the moon. Can be very frustrating: it happens often that I put in extra instructions, and the code goes faster! Probably because it moved an inner loop to a better place, usually aligned by 4, but sometimes it works better non-aligned. It's not an exact science, or put it this way: it's not a purely theoretical science, but experimental.

Consider getting jj2007's MasmBasic timing example running, particularly if you know Basic already. To use macro's effectively you must read up on those too - actually there are many such topics. I like "Art of Assembly" by Randall Hyde (1st edition). Any sections that bore you, or seem irrelevant, skip. Check out MSDN and Intel ref documents. Review posts on this forum; look at the Laboratory in particular. Almost all of my threads there deal, partly, with optimization and timing; and many others (e.g. recent ones from zedd151).

Get examples working; that's much better than just reading. For the time being concentrate on general topics, keeping in mind optimization and sub-byte number encoding. The three most important things to do are: write code, write code, and write code! It really doesn't matter, at this point, what the code does, as long as it works.

The opcode clock cycles from masm32 help are somewhat obsolete and oversimplified but still very useful as a first cut. Often you don't need any more modern / detailed info. It's great that you picked up on those yourself, good attitude.

Quote from: NyattaFauxIf I needed to make it handle an array of bit sizes I'd write each bit-size's system separately in order to optimise it to it's fullest... Unless of course that is actually putting a higher strain on the processor.

- yes that's the fastest way, separate code for each task. Macros can help - you write a set of routines just once, with appropriate "variable symbolic constants" (such as n which can be 3, 4, 5 ...) then at compile time it generates 3 separate sets of routines from the one set of macros. For instance my "MathArt" in the Workshop has (pretty sophisticated) examples of that technique. I could give the reference but encourage you to browse all the posts there, many threads are of interest.

Quote from: NyattaFauxI will sit and work on something small for the longest period of time just to ensure I am accomplishing it with the least number of clocks possible

- great attitude for an assembler coder, but apply it not only to clocks (which we usually call cycles) but to all topics (looping, arithmetic instructions, procedures, macros, ...). The quickest path to mastering optimization (which, BTW, no one in the world has) involves understanding assembler first / concurrently, you just can't do it without that knowledge base

Nyatta · September 22, 2015, 10:07:14 AM

Quote from: jj2007 on September 22, 2015, 04:49:34 AM
Quote from: NyattaFaux on September 22, 2015, 04:00:45 AMI had found them in MASM32 under Help>>Opcodes Help.

Those are a little bit outdated - the original opcodes.hlp file was last edited in April 1998 8)

Quoteit was my understanding 1 clock is the minimum?

I can't begin to explain how much my mind is being blown right now- Like I am tearing up laughing at how immensely powerful that processing power is because I can't accept it's real. :icon_eek:

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
if you execute one it will take .17; instead it would probably take about 1 whole cycle; because the 100 have been pipelined, partially parallelized.

I am under a false understanding of what a register is then ... What range of sets of registers are stowed away in a single core on average and am I running multi-core code by default?

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
it's not a purely theoretical science, but experimental.

That sounds like the entire basis as to my coding goals so far.

I do plan to utilize jj2007's MasmBasic to make the process of learning easier, it's just not top priority between getting familiar with the basic documents first: I feel I will be over-complicating my understanding by not finishing first thing first.

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
Get examples working; that's much better than just reading. For the time being concentrate on general topics, keeping in mind optimization and sub-byte number encoding. The three most important things to do are: write code, write code, and write code! It really doesn't matter, at this point, what the code does, as long as it works.

Exactly my plans now, I couldn't agree more.

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
Quote from: NyattaFauxI will sit and work on something small for the longest period of time just to ensure I am accomplishing it with the least number of clocks possible

- great attitude for an assembler coder, but apply it not only to clocks (which we usually call cycles) but to all topics (looping, arithmetic instructions, procedures, macros, ...). The quickest path to mastering optimization (which, BTW, no one in the world has) involves understanding assembler first / concurrently, you just can't do it without that knowledge base

Thank you for the compliment, and I am in complete agreement with the rest.
I had turned to asm seeing as everyone I'm close to knows how much I tend to complain about how inefficient modern mainstream coding tends to be. Now, seeing that full optimization is seeming so far out of reach, yet this language performs many times faster than I could have hoped, I certainly I filled with motivation for finding potential rocks left unturned. Ensuring a strong foundation is an important step to moving on, and currently I don't see it living up to it's full potential ... not to say it's easy, I'm just hopeful. :redface:

jj2007 · September 22, 2015, 10:49:29 AM

Quote from: NyattaFaux on September 22, 2015, 10:07:14 AMthis language performs many times faster than I could have hoped

Yeah, there has been a little progress since I learnt coding on my 8 MHz Motorola 68000 in the late 1980ies 8)

And we are pushing the limits here. If we are bored, we pick a routine from the fastest library the World has ever seen (i.e. the C runtime library) and try to do it twice as fast

rrr314159 · September 22, 2015, 12:35:56 PM

Quote...laughing at how immensely powerful that processing power is because I can't accept it's real

- gives intuitive feel for nano / micro / milli seconds, something new for human race

QuoteI am under a false understanding of what a register is then ... What range of sets of registers are stowed away in a single core on average and am I running multi-core code by default?

- all registers, instructions, cache, pipeline etc, are on each core, with some exceptions for different types of cache. Default you have a single core, to use another you create a thread

hutch-- · September 22, 2015, 03:44:12 PM

There are a few things here, clock cycles belong to an i486 or earlier (1990), as processor started using multiple instruction pipelines the idea of each instruction munching away at a given cycle count went out the door. Over time the action has been in scheduling instructions through multiple pipelines and that involves some understanding of why some instructions are faster than others. While they are no joy to start on, get the Intel manuals and especially the optimisation manual and spend some time reading it as its the best there is.

The guts of later x86/64 hardware is not something you want to try and track but the historical interface that we have as assembler instructions (mnemonics) fall into a couple of different classes, the really fast ones are hard coded in silicon where the much slower ones are written in what is called "micro code" and while this changes from one processor to another, the basic logic remains much the same.

This gives you a defacto preferred set of instructions, generally the simpler ones and in most instances constructing high speed code means preferentially using these instructions and avoiding the older style more complex instructions. Pick the right instructions and you get results like better pairing through multiple pipelines and in some cases you you get higher throughput by carefully matching instructions so that they do pair well. What happens with some of the older instructions is that they will stall both pipelines as the result is required before the other pipeline can continue.

Always the magic rule is to time algorithms to see what works the best and try and match your timing technique to the style of task that an algorithm performs.

Nyatta · September 23, 2015, 12:49:59 AM

@jj2007 and rrr314159:
That is truly amazing! Again, I knew that speaking directly to the processor opened doors to higher potential, but it seems that my understanding of said potential fell far short as to what it actually was ... likely as an act of self-preservation from potential false information; it gives me a big goofy smile thinking about the potential that asm development unlocks. :lol:

@hutch--:

Quote from: hutch-- on September 22, 2015, 03:44:12 PM
While they are no joy to start on, get the Intel manuals and especially the optimisation manual and spend some time reading it as its the best there is.

I had already downloaded "all seven volumes of the Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, Instruction Set Reference A-M, Instruction Set Reference NZ, Instruction Set Reference, and the System Programming Guide, Parts 1, 2 and 3", which I plan to refer to in the future after gaining a more developed understanding as to where I hope to go ... seeing as it's 3603 pages of high density information, best approached with prior understanding for the likes of me.

Quote from: hutch-- on September 22, 2015, 03:44:12 PM
the really fast ones are hard coded in silicon where the much slower ones are written in what is called "micro code"

[...]

Pick the right instructions and you get results like better pairing through multiple pipelines and in some cases you you get higher throughput by carefully matching instructions so that they do pair well. What happens with some of the older instructions is that they will stall both pipelines as the result is required before the other pipeline can continue.

It's my goal to stick entirely to hard coded instructions where possible for me to comprehend a solution for the benefit of speed, and with time hopefully I'll find some tricks to share within a few years. In the example I had given for the pairing I had stuck to that spirit and recognized that my coding was heavily dependant on awaiting the results of prior data before moving on ... I may or may not be in denial and plotting a way to process data faster ... which I already have. ;)

Quote from: hutch-- on September 22, 2015, 03:44:12 PM
Always the magic rule is to time algorithms to see what works the best and try and match your timing technique to the style of task that an algorithm performs.

I don't quite understand, "match [my] timing technique"? Is this in reference to choosing an order of fitting instructions that avoid halting potential processing power, making for more optimal code?

----

Many thanks to everyone for taking time to explain these things to me, it means a lot especially seeing as most of this information is likely repeated seemingly endlessly.

dedndave · September 23, 2015, 01:06:31 AM

guidelines used by many members...

http://www.agner.org/optimize/

http://www.mark.masmcode.com/

some info may be a bit dated
at the end of the day, measuring it is your best bet :P

The MASM Forum

News:

Working Below 8-Bit Data

jj2007

rrr314159

Nyatta

jj2007

rrr314159

hutch--

Nyatta

dedndave