The MASM Forum

General => The Campus => Topic started by: Nyatta on September 20, 2015, 12:46:58 PM

Title: Working Below 8-Bit Data
Post by: Nyatta on September 20, 2015, 12:46:58 PM
Hello all, this is my first post regarding MASM32 programming or any assembler language, I am still studing the MASM32 Introduction To Assembler before moving on to other documents: please forgive my lack of understanding and knowledge. I may have skipped a mnemonic that does exactly what I need swiftly with the explanation I hope for, I hope to cause no frustration. That said, it is my hope to load and save data that is saved smaller than 1 byte and I have some questions revolving around that so I may begin thinking in this code more fluently; my binary systems are written out, it's all a matter of knowing what I can and can't do efficiently now.

Assuming I were to need to load 2 4-bit values from 1 byte and sum them together is there a system more efficient than splitting them into two 1 byte values and then summing them (if possible)? How much speed am I sacrificing by doing this in any of the possible ways oppose to loading each 4-bit value as a 1 byte value? What if the data were 3-bit, causing data fragments to be split between bytes (assuming I had to load it per-byte)?
Context as to why I personally see this as needed: Working with a 1 gb file vs a 512 mb file of the same data.

Edit:
Because I feel wrong asking this having shown no attempt at accomplishing this, I present my first Assembler code! ( :redface: ):
   .386
   .model flat, stdcall
   option casemap :none

   .code

start:

   jmp @F
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.

  @@:

  push al, twoVals ;163, aka our 10 and 3, now in al.
  push ah, twoVals ;163, aka our 10 and 3, now in ah.
  shl  al, 4       ;163 becomes 48, 3 with 4 extra bits trailing it.
  shr  ah, 4       ;163 becomes 10!
  ror  al, 4       ;48 becomes 3!
  add  al, ah      ;ah is added to al with no flag, resulting in 13!

end start


, Best Regards.
Title: Re: Working Below 8-Bit Data
Post by: dedndave on September 20, 2015, 02:11:41 PM
first - don't try to push a byte
in 32-bit code, the smallest size you should push is 32 bits   :P
you always want the stack address to be 4-byte aligned (0, 4, 8, etc)

there is actually an instruction that will do that
it's named AAM
however, it is intended for working on packed BCD values, not binary
it can be hard-coded (actually, an undocumented capability, that is well-known)
the opcode for AAM is 0D4h, 0Ah
the 0Ah is a decimal 10, used for BCD math (it's a divisor for splitting the byte in AL)
but, if you hard-code
    db 0D4h,10h
it will split the byte in AL into 2 parts
the upper 4 bits of AL will be the lower 4 bits in AH
and the upper 4 bits of AL will be 0's, leaving the original lower 4 bits

the problems are....
first, it's undocumented - kind of a no-no - lol
second, it's very slow - 85 clock cycles or something terrible

it's much faster to use 2 registers
    mov     edx,eax
    shr     edx,4
    mov     ah,dl
    and     ax,0F0Fh


there's probably some cool way to do it even faster
Title: Re: Working Below 8-Bit Data
Post by: dedndave on September 20, 2015, 02:18:22 PM
ok, AAM was 80 or 85 clock cycles on the 8088 CPU
on a pentium, i guess it's around 18 clock cycles
the other method is still faster
Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 20, 2015, 03:24:48 PM
Oh my, I had typed "push" intending to enter "mov", thinking one thing and typing another.
I had meant to post this:
start:

   jmp @F
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.

  @@:

  mov  al, twoVals ;163, aka our 10 and 3, now in al.
  mov  ah, twoVals ;163, aka our 10 and 3, now in ah.
  shl  al, 4       ;163 becomes 48, 3 with 4 extra bits trailing it.
  shr  ah, 4       ;163 becomes 10!
  ror  al, 4       ;48 becomes 3!
  add  al, ah      ;ah is added to al with no flag, resulting in 13!

end start


84 clock cycles for something so seemingly miniscule that will be repeated many times sounds like an absolute nightmare.

Quote from: dedndave on September 20, 2015, 02:11:41 PMit's much faster to use 2 registers
    mov     edx,eax
    shr     edx,4
    mov     ah,dl
    and     ax,0F0Fh
Does the code that you had mentioned, that uses 2 registers, still fit the updated code where I corrected my mistype?: I'm trying very hard to dissect what that code does line-for-line, perhaps while I'm cutting it out of context?
Title: Re: Working Below 8-Bit Data
Post by: dedndave on September 20, 2015, 03:29:46 PM
well, there are ways to do it using only EAX, of course
let's see....

assuming the value is in AL...
    mov     ah,al
    and     al,0Fh
    shr     ah,4


that's probably a simple way   :P
when you SHR, the new bits at the top are all 0's
Title: Re: Working Below 8-Bit Data
Post by: dedndave on September 20, 2015, 03:34:57 PM
just to explain....

the reason i wanted to do this
    shr     edx,4

is because that's the fastest form of SHR (32-bit register)
the method i just showed you does SHR on an 8-bit register
you'd have to measure, but i would guess the EDX one is slightly faster
the disadvantage is that it uses 2 dword registers
Title: Re: Working Below 8-Bit Data
Post by: dedndave on September 20, 2015, 03:52:12 PM
    mov     ax,1234h   ;AX = 1234h
    mov     ah,al      ;AX = 3434h
    and     al,0Fh     ;AX = 3404h
    shr     ah,4       ;AX = 0304h
Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 20, 2015, 05:34:04 PM
Say I wanted to efficiently as possible pull 2 points of 5-bit data and add it respectively from 2 8-bit variables would this be my best (visible) option:

386:
  mov ax, 0100010000110101b ;Copies value 8 and first 3 bits of value 16 to ah, final 2 bits in al plus 6 bits to remove. 2 clocks.
  shr ax, 3                 ;ah contains value 8. 3 clocks.
  shr al, 3                 ;al contains value 16. 3 clocks.
  add ah, al                ;Adds al to ah. 2 clock.
                            ;10 clocks total.


486:
  mov ax, 0100010000110101b ;Copies value 8 and first 3 bits of value 16 to ah, final 2 bits in al plus 6 bits to remove. 1 clock.
  ror ax, 3                 ;Rotates ax 3 bits right. 2 clocks.
  ror al, 3                 ;Rotates al 3 bits right. 2 clocks.
  and ax, 0001111100011111b ;Obliterates unused data. 1 clock
  add ah, al                ;Adds al to ah. 1 clock.
                            ;7 clocks total.
Title: Re: Working Below 8-Bit Data
Post by: jj2007 on September 20, 2015, 06:54:02 PM
What about a testbed?

7 1 2 13 10 13 3 12 0 8 9 5 9 10 4 2 9 7 11 15 10 2 13 1 3 12 14 15 15 6
--- encoding 100 Million items: ---
Encoding took 115 ms
Encoding took 112 ms
Encoding took 113 ms
Encoding took 115 ms
Encoding took 113 ms
--- decoding: ---
Decoding took 51 ms
Decoding took 52 ms
Decoding took 52 ms
Decoding took 52 ms
Decoding took 52 ms
7 1 2 13 10 13 3 12 0 8 9 5 9 10 4 2 9 7 11 15 10 2 13 1 3 12 14 15 15 6


Source:

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  Items=100000000        ; 100 Million nibbles
  TimeCount=5
  ShowValues=30
  Dim Src() As BYTE      ; create array for random data
  Let edi=New$(Items/2)  ; create destination buffer for packed data
  Let ebx=New$(Items)    ; create destination buffer for unpacked data
  xor ecx, ecx
  .Repeat
      void Rand(16)      ; create random values from 0...15, i.e. 4-bit nibbles
      mov Src(ecx), al
      .if ecx<ShowValues
            Print Str$(Src(ecx)), " "      ; sample output
      .endif
      inc ecx
  .Until ecx>=Items
  PrintLine Str$("\n --- encoding %i Million items: ---", Items/1000000)
  REPEAT TimeCount
      call Encode
  ENDM
  PrintLine "--- decoding: ---"
  REPEAT TimeCount
      call Decode
  ENDM
  For_ ecx=0 To ShowValues-1
      Print Str$(byte ptr [ebx+ecx]), " "      ; print sample output from unpacked data
  Next
  Inkey CrLf$, "--- hit any key ---"
  Exit
Encode proc
  NanoTimer()
  push edi
  xor ecx, ecx
  .Repeat
      movzx eax, Src(ecx)      ; load a nibble
      inc ecx                  ; get next random value
      movzx edx, Src(ecx)      ; load another nibble
      shl edx, 4
      add eax, edx
      stosb      ; store packed
      inc ecx
  .Until ecx>=Items
  pop edi
  Print Str$("Encoding took %i ms\n", NanoTimer(ms))
  ret
Encode endp

Decode proc
  NanoTimer()
  xor ecx, ecx
  .Repeat
;       lea edx, [edi+ecx]
;       int 3
      mov al, [edi+ecx]      ; load 2 nibbles
      mov dl, al
      and al, 0FH      ; get first nibble
      mov [ebx+2*ecx], al
      shr dl, 4
      mov [ebx+2*ecx+1], dl
      inc ecx
  .Until ecx>=Items/2
  Print Str$("Decoding took %i ms\n", NanoTimer(ms))
  ret
Decode endp

EndOfCode
Title: Re: Working Below 8-Bit Data
Post by: rrr314159 on September 21, 2015, 01:13:40 AM
Hello NyattaFaux,

Glad to see you want to use assembler for your task ... to go back to your original post for a moment,

Quote from: NyattaFauxit is my hope to load and save data that is saved smaller than 1 byte

- question, do u want to handle data which might be any number of bits: 3, 4, 5 ...? Or can u settle on just one size, which would simplify the job? 4 would be most convenient

- BCD instructions handle 4-bit data. I've never used them, believe they're slower, and they don't apply to 3 or 5-bit data, so let's ignore them.

Quote from: NyattaFauxAssuming I were to need to load 2 4-bit values from 1 byte and sum them together is there a system more efficient than splitting them into two 1 byte values and then summing them (if possible)?

- With possible exception of BCD, No.

Don't do this:

.code
start:
   jmp @F
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.
  @@:


instead, this:

.data
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.
.code

start:


Now, 4 bits is most efficient (since 2 fit exactly in a byte) and allows these techniques:

; decode :
    movzx ax, twoVals
    shr ax, 4
    shl al, 4
; now the 2 vals are in al and ah, can be added etc

; or, decode as dedndave showed:
    movzx al, twoVals
    mov ah, al
    and al, 0Fh
    shr ah, 4

; these decodes are about the same speed

; encode (assume the two nibbles are in al and ah) :
    shl ah, 4
    or al, ah
    mov twoVals, al


However with "n" bits (3, 4, 5) it's a bit clumsier. The decode you gave for "386",


  mov ax, 0100010000110101b ;Copies value 8 and first 3 bits of value 16 to ah, final 2 bits in al plus 6 bits to remove. 2 clocks.
  shr ax, 3                 ;ah contains value 8. 3 clocks.
  shr al, 3                 ;al contains value 16. 3 clocks.
  add ah, al                ;Adds al to ah. 2 clock.
                            ;10 clocks total.


... is fine, and has the advantage that you can simply replace the "3" with "5" or even "n" (where n=3, or 5, etc), and it still works.

Finally, I would recommend putting the encode and decode in a macro, make the code a lot cleaner. For instance,

decode MACRO thetwoVals
    movzx ax, thetwoVals
    shr ax, 4
    shl al, 4
ENDM

; now you can use it like this:

decode twoVals
add al, ah
; ... etc




Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 21, 2015, 01:02:59 PM
@jj2007:
That does appear extremely helpful, just a bit above the understanding I've gathered thus far; I will definitely return to it when I've begun practicing including .inc files and utilizing them.

@rrr314159:
Alright, thank you for the tips, I will do my best to utilize them.

Would it be more efficient to pair data within the coding (running two mnemonics at once on separate registers)? I have no practice, but from my understanding this works by counting the clock rates between your last action utilizing that register? Say, you need a register while the register is in use, does it halt the coding on that line and wait for it to open? E.g.:
   mov al, value ;Add one clock to register.
   mov ah, value ;Add one clock to register simultaneous with last line.
   ror ax, 8     ;Waits 1 clock to perform due to the registers being in use.


If that is correct, I wrote this to utilize it:
(Note, in the example I provide my data is read right to left per 4 or 8 bits to maximize to observed optimization potential.)
   .data
     eightVals db 34112141h ;1, 4, 1, 2, etc...

   .code

start:

decode MACRO eightVals
   mov eax, eightVals       ;1 clock. eightVal's value is now stored in eax.
   mov edx, eightVals       ;0 additional clocks: paired. eightVal's value is now stored in edx.

   ror edx, 4               ;2 clocks. 4-bit value in edx is now aligned to the right.

   and eax, 0F0F0F0Fh       ;1 clock. Converstion of bytes successful in eax. Contains 1, 1, 1, 4.
   and edx, 0F0F0F0Fh       ;0 additional clocks: paired. Converstion of bytes successful in eax. Contains 4, 2, 1, 3.
ENDM

   decode eightVals
   add al, dl               ;1 clock. Begin respectively ordered addition. 4+1=5
   add al, ah               ;1 clock. 1+5=6
   add al, dh               ;1 clock. 2+6=8

   mov bl, al               ;1 clock. Transfer my current total out of harms way. bl=8

   ror eax, 16               ;2 clocks. Retrieving next 2 values in eax. eax reads 1, 4, 8, 1.
   ror edx, 16               ;0 additional clocks: paired. Retrieving next 2 values in edx. edx reads 1, 3, 4, 2.

   add al, bl               ;1 clock. 8+1=9
   add al, dl               ;1 clock. 1+9=A
   add al, ah               ;1 clock. 4+A=E
   add al, dh               ;1 clock. 3+E=11, stored neatly inside of the byte of al.

end start


If this is a correct use of registers and pairing, the macros don't cost me clock cycles to run, and the code functions as intended, I should have just achieved the sequential summing of 8 4-bit values in 14 clock cycles. 1.75 clocks per 4-bits.
Macros are inserted upon their call location while compiling, correct? Two calls making for 2 copies of it in the compiled program?
Title: Re: Working Below 8-Bit Data
Post by: jj2007 on September 21, 2015, 05:06:37 PM
Quote from: NyattaFaux on September 21, 2015, 01:02:59 PM
   and eax, 0F0F0F0Fh       ;1 clock. Converstion of bytes successful in eax. Contains 1, 1, 1, 4.
   and edx, 0F0F0F0Fh       ;0 additional clocks: paired

My timings (attached) tell a different story...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
18      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

21      bytes for eax+edx
10      bytes for eax only
0       bytes for empty loop


QuoteMacros are inserted upon their call location while compiling, correct? Two calls making for 2 copies of it in the compiled program?

Yes, exactly.
Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 21, 2015, 06:09:29 PM
@jj2007:
I'm not sure as to what I am looking at... "cycles for 100"? How many clocks over is it and was it paired?
From the "Introduction to Assembler" document: "pairing, when mnemonics can go through the two pipelines in pairs, the code runs nominally twice as fast." My interpretation was that by performing mnemonics on multiple fitting registers I am cutting the total clocks down. Could it have been that I was using EDX for an "and" mnemonic?

@rrr314159:
My apologies, I hadn't responded to your question to which the answer is yes: the dynamacy would be nice. If I needed to make it handle an array of bit sizes I'd write each bit-size's system separately in order to optimise it to it's fullest... Unless of course that is actually putting a higher strain on the processor.

Mini-rant:
I will sit and work on something small for the longest period of time just to ensure I am accomplishing it with the least number of clocks possible (as it is very important to me and to lengthy looping systems). What I am doing now, practicing optimized machine code to push my dreams into reality, is really amazing so far because ever since I got the idea in my head to do this I've had not the slightest understanding as to what I was reading ... in part because I was too eager, too excited, not aware enough of how these systems work, and not patient enough to actually find the words I needed to read and read them ... REGARDLESS, years later I am really motivated and feel ready to be learning this now, I'm going to be putting time into forming an understanding of optimal systems to work by, and I'll do my best to keep any silly questions to a minimum and utilize all of the current learning resources. Point being: I'm trying very hard not to be obnoxious and plan to be around. :redface:
I have yet to finish the "Introduction to Assembler" document let alone any other resource, I am working on getting that down at a careful speed.
Title: Re: Working Below 8-Bit Data
Post by: jj2007 on September 21, 2015, 06:18:07 PM
Quote from: NyattaFaux on September 21, 2015, 06:09:29 PM
@jj2007:
I'm not sure as to what I am looking at... "cycles for 100"? How many clocks over is it and was it paired?

See lines 46ff and 67ff attached above for "was it paired".
Re clocks, yes these numbers are for one-hundred loop iterations. So that makes it 0.16 cycles per iteration, e.g. for the eax only version:

  mov ebx, AlgoLoops-1      ; loop e.g. 100x
  align 4
  .Repeat
       mov eax, eightVals       ;1 clock. eightVal's value is now stored in eax.
;        mov edx, eightVals       ;0 additional clocks: paired. eightVal's value is now stored in edx.
     
;        ror edx, 4               ;2 clocks. 4-bit value in edx is now aligned to the right.
     
       and eax, 0F0F0F0Fh       ;1 clock. Converstion of bytes successful in eax. Contains 1, 1, 1, 4.
;        and edx, 0F0F0F0Fh       ;0 additional clocks: paired. Converstion of bytes successful in eax. Contains 4, 2, 1, 3.
      dec ebx
  .Until Sign?


Where did you get these "1 clock" etc?
Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 22, 2015, 04:00:45 AM
Quote from: jj2007 on September 21, 2015, 06:18:07 PM
Where did you get these "1 clock" etc?

I had found them in MASM32 under Help>>Opcodes Help. It brings up the Intel Opcodes and Mnemonics window. For "AND" I read that "register, immediate" on a .486 is 1 clock cycle. I'm not sure I understand how I could have created something that clocks less than 1 cycle: it was my understanding 1 clock is the minimum?
Title: Re: Working Below 8-Bit Data
Post by: jj2007 on September 22, 2015, 04:49:34 AM
Quote from: NyattaFaux on September 22, 2015, 04:00:45 AMI had found them in MASM32 under Help>>Opcodes Help.

Those are a little bit outdated - the original opcodes.hlp file was last edited in April 1998 8)

Quoteit was my understanding 1 clock is the minimum?

Yep, that was also my understanding until I was forced, by valuable membersTM of this Forum, to accept that the story is a little bit more complex. If you want to dig deeper, read here (http://electronics.stackexchange.com/questions/123760/how-can-a-cpu-deliver-more-than-one-instruction-per-cycle) Paul A. Clayton's answer, the one that starts with
QuoteThe Ivy Bridge microarchitecture of the i7 3630QM can only commit 4 fused µops per cycle, though it can begin execution of 6 µops per cycle.
Title: Re: Working Below 8-Bit Data
Post by: rrr314159 on September 22, 2015, 08:19:07 AM
Hi NyattaFaux,

code optimization is complicated. The biggest name in the subject is Agner Fog, google him with "assembler optimization". You can get multiple instructions executing in parallel but it's not as simple as just using different registers. Also if 100 instructions take 17 cycles then obviously one averaged .17, but that's not to say if you execute one it will take .17; instead it would probably take about 1 whole cycle; because the 100 have been pipelined, partially parallelized.

It's very sensible to read such documents (and Paul Clayton) but you can't optimize purely theoretically; you have to code and time alternatives. It also varies per computer, OS, environment, and (it seems) phase of the moon. Can be very frustrating: it happens often that I put in extra instructions, and the code goes faster! Probably because it moved an inner loop to a better place, usually aligned by 4, but sometimes it works better non-aligned. It's not an exact science, or put it this way: it's not a purely theoretical science, but experimental.

Consider getting jj2007's MasmBasic timing example running, particularly if you know Basic already. To use macro's effectively you must read up on those too - actually there are many such topics. I like "Art of Assembly" by Randall Hyde (1st edition). Any sections that bore you, or seem irrelevant, skip. Check out MSDN and Intel ref documents. Review posts on this forum; look at the Laboratory in particular. Almost all of my threads there deal, partly, with optimization and timing; and many others (e.g. recent ones from zedd151).

Get examples working; that's much better than just reading. For the time being concentrate on general topics, keeping in mind optimization and sub-byte number encoding. The three most important things to do are: write code, write code, and write code! It really doesn't matter, at this point, what the code does, as long as it works.

The opcode clock cycles from masm32 help are somewhat obsolete and oversimplified but still very useful as a first cut. Often you don't need any more modern / detailed info. It's great that you picked up on those yourself, good attitude.

Quote from: NyattaFauxIf I needed to make it handle an array of bit sizes I'd write each bit-size's system separately in order to optimise it to it's fullest... Unless of course that is actually putting a higher strain on the processor.

- yes that's the fastest way, separate code for each task. Macros can help - you write a set of routines just once, with appropriate "variable symbolic constants" (such as n which can be 3, 4, 5 ...) then at compile time it generates 3 separate sets of routines from the one set of macros. For instance my "MathArt" in the Workshop has (pretty sophisticated) examples of that technique. I could give the reference but encourage you to browse all the posts there, many threads are of interest.

Quote from: NyattaFauxI will sit and work on something small for the longest period of time just to ensure I am accomplishing it with the least number of clocks possible

- great attitude for an assembler coder, but apply it not only to clocks (which we usually call cycles) but to all topics (looping, arithmetic instructions, procedures, macros, ...). The quickest path to mastering optimization (which, BTW, no one in the world has) involves understanding assembler first / concurrently, you just can't do it without that knowledge base
Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 22, 2015, 10:07:14 AM
Quote from: jj2007 on September 22, 2015, 04:49:34 AM
Quote from: NyattaFaux on September 22, 2015, 04:00:45 AMI had found them in MASM32 under Help>>Opcodes Help.

Those are a little bit outdated - the original opcodes.hlp file was last edited in April 1998 8)

Quoteit was my understanding 1 clock is the minimum?
I can't begin to explain how much my mind is being blown right now- Like I am tearing up laughing at how immensely powerful that processing power is because I can't accept it's real. :icon_eek:

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
if you execute one it will take .17; instead it would probably take about 1 whole cycle; because the 100 have been pipelined, partially parallelized.
I am under a false understanding of what a register is then ... What range of sets of registers are stowed away in a single core on average and am I running multi-core code by default?

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
it's not a purely theoretical science, but experimental.
That sounds like the entire basis as to my coding goals so far.

I do plan to utilize jj2007's MasmBasic to make the process of learning easier, it's just not top priority between getting familiar with the basic documents first: I feel I will be over-complicating my understanding by not finishing first thing first.

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
Get examples working; that's much better than just reading. For the time being concentrate on general topics, keeping in mind optimization and sub-byte number encoding. The three most important things to do are: write code, write code, and write code! It really doesn't matter, at this point, what the code does, as long as it works.
Exactly my plans now, I couldn't agree more.

Quote from: rrr314159 on September 22, 2015, 08:19:07 AM
Quote from: NyattaFauxI will sit and work on something small for the longest period of time just to ensure I am accomplishing it with the least number of clocks possible

- great attitude for an assembler coder, but apply it not only to clocks (which we usually call cycles) but to all topics (looping, arithmetic instructions, procedures, macros, ...). The quickest path to mastering optimization (which, BTW, no one in the world has) involves understanding assembler first / concurrently, you just can't do it without that knowledge base
Thank you for the compliment, and I am in complete agreement with the rest.
I had turned to asm seeing as everyone I'm close to knows how much I tend to complain about how inefficient modern mainstream coding tends to be. Now, seeing that full optimization is seeming so far out of reach, yet this language performs many times faster than I could have hoped, I certainly I filled with motivation for finding potential rocks left unturned. Ensuring a strong foundation is an important step to moving on, and currently I don't see it living up to it's full potential ... not to say it's easy, I'm just hopeful. :redface:
Title: Re: Working Below 8-Bit Data
Post by: jj2007 on September 22, 2015, 10:49:29 AM
Quote from: NyattaFaux on September 22, 2015, 10:07:14 AMthis language performs many times faster than I could have hoped

Yeah, there has been a little progress since I learnt coding on my 8 MHz Motorola 68000 in the late 1980ies  8)

And we are pushing the limits here. If we are bored, we pick a routine from the fastest library the World has ever seen (i.e. the C runtime library) and try to do it twice as fast (http://masm32.com/board/index.php?topic=4631.0)  :biggrin:
Title: Re: Working Below 8-Bit Data
Post by: rrr314159 on September 22, 2015, 12:35:56 PM
Quote...laughing at how immensely powerful that processing power is because I can't accept it's real

- gives intuitive feel for nano / micro / milli seconds, something new for human race

QuoteI am under a false understanding of what a register is then ... What range of sets of registers are stowed away in a single core on average and am I running multi-core code by default?

- all registers, instructions, cache, pipeline etc, are on each core, with some exceptions for different types of cache. Default you have a single core, to use another you create a thread

Title: Re: Working Below 8-Bit Data
Post by: hutch-- on September 22, 2015, 03:44:12 PM
There are a few things here, clock cycles belong to an i486 or earlier (1990), as processor started using multiple instruction pipelines the idea of each instruction munching away at a given cycle count went out the door. Over time the action has been in scheduling instructions through multiple pipelines and that involves some understanding of why some instructions are faster than others. While they are no joy to start on, get the Intel manuals and especially the optimisation manual and spend some time reading it as its the best there is.

The guts of later x86/64 hardware is not something you want to try and track but the historical interface that we have as assembler instructions (mnemonics) fall into a couple of different classes, the really fast ones are hard coded in silicon where the much slower ones are written in what is called "micro code" and while this changes from one processor to another, the basic logic remains much the same.

This gives you a defacto preferred set of instructions, generally the simpler ones and in most instances constructing high speed code means preferentially using these instructions and avoiding the older style more complex instructions. Pick the right instructions and you get results like better pairing through multiple pipelines and in some cases you you get higher throughput by carefully matching instructions so that they do pair well. What happens with some of the older instructions is that they will stall both pipelines as the result is required before the other pipeline can continue.

Always the magic rule is to time algorithms to see what works the best and try and match your timing technique to the style of task that an algorithm performs.
Title: Re: Working Below 8-Bit Data
Post by: Nyatta on September 23, 2015, 12:49:59 AM
@jj2007 and rrr314159:
That is truly amazing! Again, I knew that speaking directly to the processor opened doors to higher potential, but it seems that my understanding of said potential fell far short as to what it actually was ... likely as an act of self-preservation from potential false information; it gives me a big goofy smile thinking about the potential that asm development unlocks. :lol:

@hutch--:
Quote from: hutch-- on September 22, 2015, 03:44:12 PM
While they are no joy to start on, get the Intel manuals and especially the optimisation manual and spend some time reading it as its the best there is.
I had already downloaded "all seven volumes of the Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, Instruction Set Reference A-M, Instruction Set Reference NZ, Instruction Set Reference, and the System Programming Guide, Parts 1, 2 and 3", which I plan to refer to in the future after gaining a more developed understanding as to where I hope to go ... seeing as it's 3603 pages of high density information, best approached with prior understanding for the likes of me.

Quote from: hutch-- on September 22, 2015, 03:44:12 PM
the really fast ones are hard coded in silicon where the much slower ones are written in what is called "micro code"

[...]

Pick the right instructions and you get results like better pairing through multiple pipelines and in some cases you you get higher throughput by carefully matching instructions so that they do pair well. What happens with some of the older instructions is that they will stall both pipelines as the result is required before the other pipeline can continue.
It's my goal to stick entirely to hard coded instructions where possible for me to comprehend a solution for the benefit of speed, and with time hopefully I'll find some tricks to share within a few years. In the example I had given for the pairing I had stuck to that spirit and recognized that my coding was heavily dependant on awaiting the results of prior data before moving on ... I may or may not be in denial and plotting a way to process data faster ... which I already have. ;)

Quote from: hutch-- on September 22, 2015, 03:44:12 PM
Always the magic rule is to time algorithms to see what works the best and try and match your timing technique to the style of task that an algorithm performs.
I don't quite understand, "match [my] timing technique"? Is this in reference to choosing an order of fitting instructions that avoid halting potential processing power, making for more optimal code?

----

Many thanks to everyone for taking time to explain these things to me, it means a lot especially seeing as most of this information is likely repeated seemingly endlessly.
Title: Re: Working Below 8-Bit Data
Post by: dedndave on September 23, 2015, 01:06:31 AM
guidelines used by many members...

http://www.agner.org/optimize/ (http://www.agner.org/optimize/)

http://www.mark.masmcode.com/ (http://www.mark.masmcode.com/)

some info may be a bit dated
at the end of the day, measuring it is your best bet   :P