News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Bits Scaling to 2

Started by Antariy, January 06, 2013, 12:25:19 PM

Previous topic - Next topic

dedndave

well - certainly, an LUT would be much faster
4 look-ups into a 256-byte table should be about 20-25 clock cycles or something   :P

i still think the sexy stir-fry would work, though - lol

jj2007

Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16->32?

1010 -> 11001100
0101 -> 00110011
10101010 -> 1100110011001100

As Dave already wrote, a 256-word LUT would be ideal...

dedndave

yah - that's probably the easiest way to make it fast
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at start-up   :t

Gunther

Hi Dave,

Quote from: dedndave on January 07, 2013, 04:18:27 AM
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at start-up   :t

good idea. Such a thing is called initial calculation (computation).  :t

Gunther
You have to know the facts before you can distort them.

dedndave

we just call in an "init routine"   :biggrin:

Jochen did something like that for his float-to-ascii thingy

"thingy" = highly technical term
if i tell you about it, i have to kill ya

Gunther

Hi Dave,

"init" or "initial", there's not much difference at all.

Quote from: dedndave on January 07, 2013, 04:29:37 AM
Jochen did something like that for his float-to-ascii thingy

"thingy" = highly technical term
if i tell you about it, i have to kill ya

I know, I know.  :lol:

Gunther
You have to know the facts before you can distort them.

dedndave

doh !

it wouldn't be a 256-byte table - lol
it would be 4 accesses on a 16-byte table - put it in initialized data   :t

you could go the next step up and make 2 accesses on a 256-word table
that one, you could make in unitialized data and use an "initial calculation (computation)" routine   :biggrin:

jj2007

Here is it:

LUT:
dw 0000000000000000b, 0000000000000011b, 0000000000001100b, 0000000000001111b
dw 0000000000110000b, 0000000000110011b, 0000000000111100b, 0000000000111111b
...
dw 1111111111000000b, 1111111111000011b, 1111111111001100b, 1111111111001111b
dw 1111111111110000b, 1111111111110011b, 1111111111111100b, 1111111111111111b


BuildLUT:
  mov edi, offset LUT
  xor ecx, ecx
  inc ch
  or edx, -1
  .Repeat
        inc edx
        xor eax, eax
        push 8
        .Repeat
                shl eax, 2
                rol dl, 1
                .if Carry?
                        or eax, 3        ; set 11
                .endif
                dec dword ptr [esp]
        .Until Zero?
        stosw
        pop eax
        dec ecx
  .Until Zero?
  retn

Gunther

Good job, Jochen.  :t Thank you. I think that Alex will enjoy that.

Gunther
You have to know the facts before you can distort them.

dedndave

i see what you mean, Alex
there is probably some mathematical way to do it
a good problem for the guys at The Euler Project   :P

waBdLut dw 256 dup(?)

InitBdLut PROC

        mov     edx,255
        push    ebx
        push    edi
        mov     ebx,edx
        mov     edi,offset waBdLut+508

IBLut0: mov     ecx,8

IBLut1: ror     edx,1
        rcr     eax,1
        shl     edx,1
        rcr     eax,1
        shr     edx,1
        dec     ecx
        jnz     IBLut1

        mov     [edi],eax
        sub     edi,2
        dec     ebx
        mov     edx,ebx
        jnz     IBLut0

        mov     [edi+2],bl
        pop     edi
        pop     ebx
        ret

InitBdLut ENDP


0000 0003 000C 000F FFF0 FFF3 FFFC FFFF
48 bytes of code
30047 30030 30011 29985 29998


Antariy

Thanks for all the tests! :biggrin:

Quote from: jj2007 on January 07, 2013, 04:12:43 AM
Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16->32?

1010 -> 11001100
0101 -> 00110011
10101010 -> 1100110011001100

As Dave already wrote, a 256-word LUT would be ideal...

Hi Jochen, yes, there should be more code behind. This proc should be just collaborative proc to help to build LUT to use by the other algo :biggrin: So, the speed isn't a concern here (but it is interesting), but size is preferably. In the test as you see I have used the same algo in every test, just moved couple of instructions to see how it will perform. The test showed that SBB doesn't slows the code down much, DEC has different behaviours on different CPUs generations and models, as usual, and the timings are pretty stable - probably the method used (different sections of the code for every tested code) suited its task here.

Yes, the rule is every 1 bit becomes 2 bits.

Practically, this algo intents to build WORD-indexed LUT from BYTE-indexed LUT to avoid any computations (WORD->BYTE) in real time in the algo which will use that WORD index.

Something like this:

XMM0=0012 0034 0056 0078 009A 00BC 00DE 00F0
XMM1=0012 0000 0056 0000 0000 00BC 00DE 0000

PCMPEQW XMM0,XMM1

; XMM0 =  FFFF 0000 FFFF 0000 0000 FFFF FFFF 0000

PMOVMSKB EDX,XMM0

; EDX = 11 00 11 00 00 11 11 00

; [EDX+offset LUT] - some data corresponding to a pattern of the WORDs layout in the XMMs.


As you see, here we have doubled (word comparsion) resolution, and instead of using "resizing" it to one-bit-per-word and accessing 256 bytes LUT (8 bit index), it is faster to use 16 bit index LUT, which need to build from 256 byte LUT with the help of the the topic algo (to convert byte index to word index while filling the 16 bit indexed table).
Something like:

; EBX is a byte index
@@:
invoke ScaleBits,ebx,8
movzx edx,byte ptr [ebx+Lut256]
mov [eax+Lut64k],dl
inc ebx
loop @B



Hi Dave, I understand your point about speeding it up with a LUT, but in this example it is better to use mathematical way, because it then is a usage of the one algo to build the LUT of the other algo that builds the table of the next other algo :biggrin:

dedndave

i get 15 clock cycles on my p4 prescott w/htt   :P

BitDbl  PROC    dwVal:DWORD

        movzx   eax,byte ptr dwVal+1
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr dwVal
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret

BitDbl  ENDP

dedndave

i get 9 cycles for this one...

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitDbl  PROC    dwVal:DWORD

        movzx   eax,byte ptr [esp+5]
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr [esp+4]
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret     4

BitDbl  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

frktons

Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
116     cycles for 1
119     cycles for 2 (dec ecx)
119     cycles for 3 (lea edi)
123     cycles for 4 (sbb edx)
116     cycles for 1
120     cycles for 2 (dec ecx)
119     cycles for 3 (lea edi)
122     cycles for 4 (sbb edx)
116     cycles for 1
120     cycles for 2 (dec ecx)
119     cycles for 3 (lea edi)
133     cycles for 4 (sbb edx)

My tests for Alex routine.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Antariy

Very cool, Dave :t
But still in the case using of the algo to build a table to be used with a collaborative (used once at the time of the process initialization) proc to build a table to be used with a "main" algo, is a "bit too-too (too much things)" :biggrin:
Also, the scalability of the scalling (pun) proc would be a good thing in the case, i.e. if we wanted to make scaling optional - 2, 4 or 8 times - for your algo that means building of a 3 different tables. But it is really fast (9...11 clocks on my CPU) :icon14: