Bits Scaling to 2

dedndave · January 07, 2013, 03:49:06 AM

well - certainly, an LUT would be much faster
4 look-ups into a 256-byte table should be about 20-25 clock cycles or something :P

i still think the sexy stir-fry would work, though - lol

jj2007 · January 07, 2013, 04:12:43 AM

Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16->32?

1010 -> 11001100
0101 -> 00110011
10101010 -> 1100110011001100

As Dave already wrote, a 256-word LUT would be ideal...

dedndave · January 07, 2013, 04:18:27 AM

yah - that's probably the easiest way to make it fast
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at start-up :t

Gunther · January 07, 2013, 04:27:08 AM

Hi Dave,

Quote from: dedndave on January 07, 2013, 04:18:27 AM
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at start-up :t

good idea. Such a thing is called initial calculation (computation). :t

Gunther

dedndave · January 07, 2013, 04:29:37 AM

we just call in an "init routine"

Jochen did something like that for his float-to-ascii thingy

"thingy" = highly technical term
if i tell you about it, i have to kill ya

Gunther · January 07, 2013, 04:33:06 AM

Hi Dave,

"init" or "initial", there's not much difference at all.

Quote from: dedndave on January 07, 2013, 04:29:37 AM
Jochen did something like that for his float-to-ascii thingy

"thingy" = highly technical term
if i tell you about it, i have to kill ya

I know, I know. :lol:

Gunther

dedndave · January 07, 2013, 04:45:54 AM

doh !

it wouldn't be a 256-byte table - lol
it would be 4 accesses on a 16-byte table - put it in initialized data :t

you could go the next step up and make 2 accesses on a 256-word table
that one, you could make in unitialized data and use an "initial calculation (computation)" routine

jj2007 · January 07, 2013, 05:02:46 AM

Here is it:

LUT:
dw 0000000000000000b, 0000000000000011b, 0000000000001100b, 0000000000001111b
dw 0000000000110000b, 0000000000110011b, 0000000000111100b, 0000000000111111b
...
dw 1111111111000000b, 1111111111000011b, 1111111111001100b, 1111111111001111b
dw 1111111111110000b, 1111111111110011b, 1111111111111100b, 1111111111111111b

BuildLUT:
mov edi, offset LUT
xor ecx, ecx
inc ch
or edx, -1
.Repeat
inc edx
xor eax, eax
push 8
.Repeat
shl eax, 2
rol dl, 1
.if Carry?
or eax, 3 ; set 11
.endif
dec dword ptr [esp]
.Until Zero?
stosw
pop eax
dec ecx
.Until Zero?
retn

Gunther · January 07, 2013, 05:50:15 AM

Good job, Jochen. :t Thank you. I think that Alex will enjoy that.

Gunther

dedndave · January 07, 2013, 06:17:34 AM

i see what you mean, Alex
there is probably some mathematical way to do it
a good problem for the guys at The Euler Project :P

Code Select

waBdLut dw 256 dup(?)

InitBdLut PROC

        mov     edx,255
        push    ebx
        push    edi
        mov     ebx,edx
        mov     edi,offset waBdLut+508

IBLut0: mov     ecx,8

IBLut1: ror     edx,1
        rcr     eax,1
        shl     edx,1
        rcr     eax,1
        shr     edx,1
        dec     ecx
        jnz     IBLut1

        mov     [edi],eax
        sub     edi,2
        dec     ebx
        mov     edx,ebx
        jnz     IBLut0

        mov     [edi+2],bl
        pop     edi
        pop     ebx
        ret

InitBdLut ENDP

Code Select

0000 0003 000C 000F FFF0 FFF3 FFFC FFFF
48 bytes of code
30047 30030 30011 29985 29998

Antariy · January 07, 2013, 09:28:38 AM

Thanks for all the tests!

Quote from: jj2007 on January 07, 2013, 04:12:43 AM
Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16->32?

1010 -> 11001100
0101 -> 00110011
10101010 -> 1100110011001100

As Dave already wrote, a 256-word LUT would be ideal...

Hi Jochen, yes, there should be more code behind. This proc should be just collaborative proc to help to build LUT to use by the other algo

So, the speed isn't a concern here (but it is interesting), but size is preferably. In the test as you see I have used the same algo in every test, just moved couple of instructions to see how it will perform. The test showed that SBB doesn't slows the code down much, DEC has different behaviours on different CPUs generations and models, as usual, and the timings are pretty stable - probably the method used (different sections of the code for every tested code) suited its task here.

Yes, the rule is every 1 bit becomes 2 bits.

Practically, this algo intents to build WORD-indexed LUT from BYTE-indexed LUT to avoid any computations (WORD->BYTE) in real time in the algo which will use that WORD index.

Something like this:

XMM0=0012 0034 0056 0078 009A 00BC 00DE 00F0
XMM1=0012 0000 0056 0000 0000 00BC 00DE 0000

PCMPEQW XMM0,XMM1

; XMM0 = FFFF 0000 FFFF 0000 0000 FFFF FFFF 0000

PMOVMSKB EDX,XMM0

; EDX = 11 00 11 00 00 11 11 00

; [EDX+offset LUT] - some data corresponding to a pattern of the WORDs layout in the XMMs.

As you see, here we have doubled (word comparsion) resolution, and instead of using "resizing" it to one-bit-per-word and accessing 256 bytes LUT (8 bit index), it is faster to use 16 bit index LUT, which need to build from 256 byte LUT with the help of the the topic algo (to convert byte index to word index while filling the 16 bit indexed table).
Something like:

; EBX is a byte index
@@:
invoke ScaleBits,ebx,8
movzx edx,byte ptr [ebx+Lut256]
mov [eax+Lut64k],dl
inc ebx
loop @B

Hi Dave, I understand your point about speeding it up with a LUT, but in this example it is better to use mathematical way, because it then is a usage of the one algo to build the LUT of the other algo that builds the table of the next other algo

dedndave · January 07, 2013, 10:05:56 AM

i get 15 clock cycles on my p4 prescott w/htt :P

Code Select

BitDbl  PROC    dwVal:DWORD

        movzx   eax,byte ptr dwVal+1
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr dwVal
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret

BitDbl  ENDP

dedndave · January 07, 2013, 10:27:58 AM

i get 9 cycles for this one...

Code Select

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitDbl  PROC    dwVal:DWORD

        movzx   eax,byte ptr [esp+5]
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr [esp+4]
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret     4

BitDbl  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

frktons · January 07, 2013, 11:00:02 AM

Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
116 cycles for 1
119 cycles for 2 (dec ecx)
119 cycles for 3 (lea edi)
123 cycles for 4 (sbb edx)
116 cycles for 1
120 cycles for 2 (dec ecx)
119 cycles for 3 (lea edi)
122 cycles for 4 (sbb edx)
116 cycles for 1
120 cycles for 2 (dec ecx)
119 cycles for 3 (lea edi)
133 cycles for 4 (sbb edx)

My tests for Alex routine.

Antariy · January 07, 2013, 11:09:53 AM

Very cool, Dave :t
But still in the case using of the algo to build a table to be used with a collaborative (used once at the time of the process initialization) proc to build a table to be used with a "main" algo, is a "bit too-too (too much things)"

Also, the scalability of the scalling (pun) proc would be a good thing in the case, i.e. if we wanted to make scaling optional - 2, 4 or 8 times - for your algo that means building of a 3 different tables. But it is really fast (9...11 clocks on my CPU) :icon14:

The MASM Forum

News:

Bits Scaling to 2

dedndave

jj2007

dedndave

Gunther

dedndave

Gunther

dedndave

jj2007

Gunther

dedndave

Antariy

dedndave

dedndave

frktons

Antariy