News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Working Below 8-Bit Data

Started by Nyatta, September 20, 2015, 12:46:58 PM

Previous topic - Next topic

Nyatta

Hello all, this is my first post regarding MASM32 programming or any assembler language, I am still studing the MASM32 Introduction To Assembler before moving on to other documents: please forgive my lack of understanding and knowledge. I may have skipped a mnemonic that does exactly what I need swiftly with the explanation I hope for, I hope to cause no frustration. That said, it is my hope to load and save data that is saved smaller than 1 byte and I have some questions revolving around that so I may begin thinking in this code more fluently; my binary systems are written out, it's all a matter of knowing what I can and can't do efficiently now.

Assuming I were to need to load 2 4-bit values from 1 byte and sum them together is there a system more efficient than splitting them into two 1 byte values and then summing them (if possible)? How much speed am I sacrificing by doing this in any of the possible ways oppose to loading each 4-bit value as a 1 byte value? What if the data were 3-bit, causing data fragments to be split between bytes (assuming I had to load it per-byte)?
Context as to why I personally see this as needed: Working with a 1 gb file vs a 512 mb file of the same data.

Edit:
Because I feel wrong asking this having shown no attempt at accomplishing this, I present my first Assembler code! ( :redface: ):
   .386
   .model flat, stdcall
   option casemap :none

   .code

start:

   jmp @F
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.

  @@:

  push al, twoVals ;163, aka our 10 and 3, now in al.
  push ah, twoVals ;163, aka our 10 and 3, now in ah.
  shl  al, 4       ;163 becomes 48, 3 with 4 extra bits trailing it.
  shr  ah, 4       ;163 becomes 10!
  ror  al, 4       ;48 becomes 3!
  add  al, ah      ;ah is added to al with no flag, resulting in 13!

end start


, Best Regards.

dedndave

first - don't try to push a byte
in 32-bit code, the smallest size you should push is 32 bits   :P
you always want the stack address to be 4-byte aligned (0, 4, 8, etc)

there is actually an instruction that will do that
it's named AAM
however, it is intended for working on packed BCD values, not binary
it can be hard-coded (actually, an undocumented capability, that is well-known)
the opcode for AAM is 0D4h, 0Ah
the 0Ah is a decimal 10, used for BCD math (it's a divisor for splitting the byte in AL)
but, if you hard-code
    db 0D4h,10h
it will split the byte in AL into 2 parts
the upper 4 bits of AL will be the lower 4 bits in AH
and the upper 4 bits of AL will be 0's, leaving the original lower 4 bits

the problems are....
first, it's undocumented - kind of a no-no - lol
second, it's very slow - 85 clock cycles or something terrible

it's much faster to use 2 registers
    mov     edx,eax
    shr     edx,4
    mov     ah,dl
    and     ax,0F0Fh


there's probably some cool way to do it even faster

dedndave

ok, AAM was 80 or 85 clock cycles on the 8088 CPU
on a pentium, i guess it's around 18 clock cycles
the other method is still faster

Nyatta

Oh my, I had typed "push" intending to enter "mov", thinking one thing and typing another.
I had meant to post this:
start:

   jmp @F
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.

  @@:

  mov  al, twoVals ;163, aka our 10 and 3, now in al.
  mov  ah, twoVals ;163, aka our 10 and 3, now in ah.
  shl  al, 4       ;163 becomes 48, 3 with 4 extra bits trailing it.
  shr  ah, 4       ;163 becomes 10!
  ror  al, 4       ;48 becomes 3!
  add  al, ah      ;ah is added to al with no flag, resulting in 13!

end start


84 clock cycles for something so seemingly miniscule that will be repeated many times sounds like an absolute nightmare.

Quote from: dedndave on September 20, 2015, 02:11:41 PMit's much faster to use 2 registers
    mov     edx,eax
    shr     edx,4
    mov     ah,dl
    and     ax,0F0Fh
Does the code that you had mentioned, that uses 2 registers, still fit the updated code where I corrected my mistype?: I'm trying very hard to dissect what that code does line-for-line, perhaps while I'm cutting it out of context?

dedndave

well, there are ways to do it using only EAX, of course
let's see....

assuming the value is in AL...
    mov     ah,al
    and     al,0Fh
    shr     ah,4


that's probably a simple way   :P
when you SHR, the new bits at the top are all 0's

dedndave

just to explain....

the reason i wanted to do this
    shr     edx,4

is because that's the fastest form of SHR (32-bit register)
the method i just showed you does SHR on an 8-bit register
you'd have to measure, but i would guess the EDX one is slightly faster
the disadvantage is that it uses 2 dword registers

dedndave

    mov     ax,1234h   ;AX = 1234h
    mov     ah,al      ;AX = 3434h
    and     al,0Fh     ;AX = 3404h
    shr     ah,4       ;AX = 0304h

Nyatta

Say I wanted to efficiently as possible pull 2 points of 5-bit data and add it respectively from 2 8-bit variables would this be my best (visible) option:

386:
  mov ax, 0100010000110101b ;Copies value 8 and first 3 bits of value 16 to ah, final 2 bits in al plus 6 bits to remove. 2 clocks.
  shr ax, 3                 ;ah contains value 8. 3 clocks.
  shr al, 3                 ;al contains value 16. 3 clocks.
  add ah, al                ;Adds al to ah. 2 clock.
                            ;10 clocks total.


486:
  mov ax, 0100010000110101b ;Copies value 8 and first 3 bits of value 16 to ah, final 2 bits in al plus 6 bits to remove. 1 clock.
  ror ax, 3                 ;Rotates ax 3 bits right. 2 clocks.
  ror al, 3                 ;Rotates al 3 bits right. 2 clocks.
  and ax, 0001111100011111b ;Obliterates unused data. 1 clock
  add ah, al                ;Adds al to ah. 1 clock.
                            ;7 clocks total.

jj2007

What about a testbed?

7 1 2 13 10 13 3 12 0 8 9 5 9 10 4 2 9 7 11 15 10 2 13 1 3 12 14 15 15 6
--- encoding 100 Million items: ---
Encoding took 115 ms
Encoding took 112 ms
Encoding took 113 ms
Encoding took 115 ms
Encoding took 113 ms
--- decoding: ---
Decoding took 51 ms
Decoding took 52 ms
Decoding took 52 ms
Decoding took 52 ms
Decoding took 52 ms
7 1 2 13 10 13 3 12 0 8 9 5 9 10 4 2 9 7 11 15 10 2 13 1 3 12 14 15 15 6


Source:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  Init
  Items=100000000        ; 100 Million nibbles
  TimeCount=5
  ShowValues=30
  Dim Src() As BYTE      ; create array for random data
  Let edi=New$(Items/2)  ; create destination buffer for packed data
  Let ebx=New$(Items)    ; create destination buffer for unpacked data
  xor ecx, ecx
  .Repeat
      void Rand(16)      ; create random values from 0...15, i.e. 4-bit nibbles
      mov Src(ecx), al
      .if ecx<ShowValues
            Print Str$(Src(ecx)), " "      ; sample output
      .endif
      inc ecx
  .Until ecx>=Items
  PrintLine Str$("\n --- encoding %i Million items: ---", Items/1000000)
  REPEAT TimeCount
      call Encode
  ENDM
  PrintLine "--- decoding: ---"
  REPEAT TimeCount
      call Decode
  ENDM
  For_ ecx=0 To ShowValues-1
      Print Str$(byte ptr [ebx+ecx]), " "      ; print sample output from unpacked data
  Next
  Inkey CrLf$, "--- hit any key ---"
  Exit
Encode proc
  NanoTimer()
  push edi
  xor ecx, ecx
  .Repeat
      movzx eax, Src(ecx)      ; load a nibble
      inc ecx                  ; get next random value
      movzx edx, Src(ecx)      ; load another nibble
      shl edx, 4
      add eax, edx
      stosb      ; store packed
      inc ecx
  .Until ecx>=Items
  pop edi
  Print Str$("Encoding took %i ms\n", NanoTimer(ms))
  ret
Encode endp

Decode proc
  NanoTimer()
  xor ecx, ecx
  .Repeat
;       lea edx, [edi+ecx]
;       int 3
      mov al, [edi+ecx]      ; load 2 nibbles
      mov dl, al
      and al, 0FH      ; get first nibble
      mov [ebx+2*ecx], al
      shr dl, 4
      mov [ebx+2*ecx+1], dl
      inc ecx
  .Until ecx>=Items/2
  Print Str$("Decoding took %i ms\n", NanoTimer(ms))
  ret
Decode endp

EndOfCode

rrr314159

#9
Hello NyattaFaux,

Glad to see you want to use assembler for your task ... to go back to your original post for a moment,

Quote from: NyattaFauxit is my hope to load and save data that is saved smaller than 1 byte

- question, do u want to handle data which might be any number of bits: 3, 4, 5 ...? Or can u settle on just one size, which would simplify the job? 4 would be most convenient

- BCD instructions handle 4-bit data. I've never used them, believe they're slower, and they don't apply to 3 or 5-bit data, so let's ignore them.

Quote from: NyattaFauxAssuming I were to need to load 2 4-bit values from 1 byte and sum them together is there a system more efficient than splitting them into two 1 byte values and then summing them (if possible)?

- With possible exception of BCD, No.

Don't do this:

.code
start:
   jmp @F
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.
  @@:


instead, this:

.data
   twoVals db 10100011 ;163, but also our 10 and 3 respectively.
.code

start:


Now, 4 bits is most efficient (since 2 fit exactly in a byte) and allows these techniques:

; decode :
    movzx ax, twoVals
    shr ax, 4
    shl al, 4
; now the 2 vals are in al and ah, can be added etc

; or, decode as dedndave showed:
    movzx al, twoVals
    mov ah, al
    and al, 0Fh
    shr ah, 4

; these decodes are about the same speed

; encode (assume the two nibbles are in al and ah) :
    shl ah, 4
    or al, ah
    mov twoVals, al


However with "n" bits (3, 4, 5) it's a bit clumsier. The decode you gave for "386",


  mov ax, 0100010000110101b ;Copies value 8 and first 3 bits of value 16 to ah, final 2 bits in al plus 6 bits to remove. 2 clocks.
  shr ax, 3                 ;ah contains value 8. 3 clocks.
  shr al, 3                 ;al contains value 16. 3 clocks.
  add ah, al                ;Adds al to ah. 2 clock.
                            ;10 clocks total.


... is fine, and has the advantage that you can simply replace the "3" with "5" or even "n" (where n=3, or 5, etc), and it still works.

Finally, I would recommend putting the encode and decode in a macro, make the code a lot cleaner. For instance,

decode MACRO thetwoVals
    movzx ax, thetwoVals
    shr ax, 4
    shl al, 4
ENDM

; now you can use it like this:

decode twoVals
add al, ah
; ... etc




I am NaN ;)

Nyatta

@jj2007:
That does appear extremely helpful, just a bit above the understanding I've gathered thus far; I will definitely return to it when I've begun practicing including .inc files and utilizing them.

@rrr314159:
Alright, thank you for the tips, I will do my best to utilize them.

Would it be more efficient to pair data within the coding (running two mnemonics at once on separate registers)? I have no practice, but from my understanding this works by counting the clock rates between your last action utilizing that register? Say, you need a register while the register is in use, does it halt the coding on that line and wait for it to open? E.g.:
   mov al, value ;Add one clock to register.
   mov ah, value ;Add one clock to register simultaneous with last line.
   ror ax, 8     ;Waits 1 clock to perform due to the registers being in use.


If that is correct, I wrote this to utilize it:
(Note, in the example I provide my data is read right to left per 4 or 8 bits to maximize to observed optimization potential.)
   .data
     eightVals db 34112141h ;1, 4, 1, 2, etc...

   .code

start:

decode MACRO eightVals
   mov eax, eightVals       ;1 clock. eightVal's value is now stored in eax.
   mov edx, eightVals       ;0 additional clocks: paired. eightVal's value is now stored in edx.

   ror edx, 4               ;2 clocks. 4-bit value in edx is now aligned to the right.

   and eax, 0F0F0F0Fh       ;1 clock. Converstion of bytes successful in eax. Contains 1, 1, 1, 4.
   and edx, 0F0F0F0Fh       ;0 additional clocks: paired. Converstion of bytes successful in eax. Contains 4, 2, 1, 3.
ENDM

   decode eightVals
   add al, dl               ;1 clock. Begin respectively ordered addition. 4+1=5
   add al, ah               ;1 clock. 1+5=6
   add al, dh               ;1 clock. 2+6=8

   mov bl, al               ;1 clock. Transfer my current total out of harms way. bl=8

   ror eax, 16               ;2 clocks. Retrieving next 2 values in eax. eax reads 1, 4, 8, 1.
   ror edx, 16               ;0 additional clocks: paired. Retrieving next 2 values in edx. edx reads 1, 3, 4, 2.

   add al, bl               ;1 clock. 8+1=9
   add al, dl               ;1 clock. 1+9=A
   add al, ah               ;1 clock. 4+A=E
   add al, dh               ;1 clock. 3+E=11, stored neatly inside of the byte of al.

end start


If this is a correct use of registers and pairing, the macros don't cost me clock cycles to run, and the code functions as intended, I should have just achieved the sequential summing of 8 4-bit values in 14 clock cycles. 1.75 clocks per 4-bits.
Macros are inserted upon their call location while compiling, correct? Two calls making for 2 copies of it in the compiled program?

jj2007

Quote from: NyattaFaux on September 21, 2015, 01:02:59 PM
   and eax, 0F0F0F0Fh       ;1 clock. Converstion of bytes successful in eax. Contains 1, 1, 1, 4.
   and edx, 0F0F0F0Fh       ;0 additional clocks: paired

My timings (attached) tell a different story...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
18      cycles for 100 * eax only
0       cycles for 100 * empty loop

55      cycles for 100 * eax+edx
17      cycles for 100 * eax only
0       cycles for 100 * empty loop

21      bytes for eax+edx
10      bytes for eax only
0       bytes for empty loop


QuoteMacros are inserted upon their call location while compiling, correct? Two calls making for 2 copies of it in the compiled program?

Yes, exactly.

Nyatta

@jj2007:
I'm not sure as to what I am looking at... "cycles for 100"? How many clocks over is it and was it paired?
From the "Introduction to Assembler" document: "pairing, when mnemonics can go through the two pipelines in pairs, the code runs nominally twice as fast." My interpretation was that by performing mnemonics on multiple fitting registers I am cutting the total clocks down. Could it have been that I was using EDX for an "and" mnemonic?

@rrr314159:
My apologies, I hadn't responded to your question to which the answer is yes: the dynamacy would be nice. If I needed to make it handle an array of bit sizes I'd write each bit-size's system separately in order to optimise it to it's fullest... Unless of course that is actually putting a higher strain on the processor.

Mini-rant:
I will sit and work on something small for the longest period of time just to ensure I am accomplishing it with the least number of clocks possible (as it is very important to me and to lengthy looping systems). What I am doing now, practicing optimized machine code to push my dreams into reality, is really amazing so far because ever since I got the idea in my head to do this I've had not the slightest understanding as to what I was reading ... in part because I was too eager, too excited, not aware enough of how these systems work, and not patient enough to actually find the words I needed to read and read them ... REGARDLESS, years later I am really motivated and feel ready to be learning this now, I'm going to be putting time into forming an understanding of optimal systems to work by, and I'll do my best to keep any silly questions to a minimum and utilize all of the current learning resources. Point being: I'm trying very hard not to be obnoxious and plan to be around. :redface:
I have yet to finish the "Introduction to Assembler" document let alone any other resource, I am working on getting that down at a careful speed.

jj2007

Quote from: NyattaFaux on September 21, 2015, 06:09:29 PM
@jj2007:
I'm not sure as to what I am looking at... "cycles for 100"? How many clocks over is it and was it paired?

See lines 46ff and 67ff attached above for "was it paired".
Re clocks, yes these numbers are for one-hundred loop iterations. So that makes it 0.16 cycles per iteration, e.g. for the eax only version:

  mov ebx, AlgoLoops-1      ; loop e.g. 100x
  align 4
  .Repeat
       mov eax, eightVals       ;1 clock. eightVal's value is now stored in eax.
;        mov edx, eightVals       ;0 additional clocks: paired. eightVal's value is now stored in edx.
     
;        ror edx, 4               ;2 clocks. 4-bit value in edx is now aligned to the right.
     
       and eax, 0F0F0F0Fh       ;1 clock. Converstion of bytes successful in eax. Contains 1, 1, 1, 4.
;        and edx, 0F0F0F0Fh       ;0 additional clocks: paired. Converstion of bytes successful in eax. Contains 4, 2, 1, 3.
      dec ebx
  .Until Sign?


Where did you get these "1 clock" etc?

Nyatta

Quote from: jj2007 on September 21, 2015, 06:18:07 PM
Where did you get these "1 clock" etc?

I had found them in MASM32 under Help>>Opcodes Help. It brings up the Intel Opcodes and Mnemonics window. For "AND" I read that "register, immediate" on a .486 is 1 clock cycle. I'm not sure I understand how I could have created something that clocks less than 1 cycle: it was my understanding 1 clock is the minimum?