Bits Scaling to 2

Antariy · January 06, 2013, 12:25:19 PM

Test for some variations of the proc to scale bits twice (double "width"). I.e., for instance, if we have the binary number: 1010Y then after processing it will be 11001100Y

Interesting to see how it will perform on different machines. The test processing full width - 16 bits to 32 bits, i.e. 16 iterations through the loop.

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
171     cycles for 1
205     cycles for 2 (dec ecx)
173     cycles for 3 (lea edi)
174     cycles for 4 (sbb edx)
169     cycles for 1
202     cycles for 2 (dec ecx)
175     cycles for 3 (lea edi)
176     cycles for 4 (sbb edx)
170     cycles for 1
203     cycles for 2 (dec ecx)
174     cycles for 3 (lea edi)
175     cycles for 4 (sbb edx)

--- ok ---

Also, please note on how the testbed is implemented: this is a kind of a suggestion. Each tested code piece was put in different section of the code, i.e., it starts at new page boundary (similar to align 1000h) and this should help to eliminate the influence of the code layout and placement.
Commented out buffer allocation and walking through all the bytes of the buffer at the start of the test - just to be used in the tests which require data processing.

dedndave · January 06, 2013, 12:56:01 PM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
165     cycles for 1
197     cycles for 2 (dec ecx)
170     cycles for 3 (lea edi)
168     cycles for 4 (sbb edx)
163     cycles for 1
191     cycles for 2 (dec ecx)
168     cycles for 3 (lea edi)
168     cycles for 4 (sbb edx)
166     cycles for 1
199     cycles for 2 (dec ecx)
167     cycles for 3 (lea edi)
169     cycles for 4 (sbb edx)

Antariy · January 06, 2013, 01:09:54 PM

Hi, Dave :t

It is interesting that the second variation of the code only differs by counter decrementation instruction (ADD ECX,-1 / DEC ECX), and overall it slower ~2 cycles per iteration.

sinsi · January 06, 2013, 01:26:12 PM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
88 cycles for 1
77 cycles for 2 (dec ecx)
87 cycles for 3 (lea edi)
89 cycles for 4 (sbb edx)
86 cycles for 1
76 cycles for 2 (dec ecx)
86 cycles for 3 (lea edi)
89 cycles for 4 (sbb edx)
87 cycles for 1
76 cycles for 2 (dec ecx)
92 cycles for 3 (lea edi)
88 cycles for 4 (sbb edx)

Antariy · January 06, 2013, 01:51:05 PM

Hi, sinsi :t

But AMD favors the things that Intel dislike, as usually

dedndave · January 06, 2013, 01:58:18 PM

search the old forum for the words "stir fry"

there was a routine in there for reversing the bits in a register
the method used may be applied, here
very cool algo

dedndave · January 06, 2013, 02:07:04 PM

here you go, Alex
if you can understand how that algorithm works, i believe you can apply it to what you are doing :t

http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224

Antariy · January 06, 2013, 02:08:59 PM

No, that's not reversion of the bits, but just making them "twice wide" (i.e. 0101 becomes 00110011, 1101 becomes 11110011 etc).

But it is interesting to see that algo, thank you, Dave :t

dedndave · January 06, 2013, 02:24:54 PM

i understand that
still, i think the algo could be modified to "double" the bits

sinsi · January 06, 2013, 06:35:09 PM

What number of bits are we looking at? Maximum of 16 I guess but any number? Like 3 bits or 11?

Gunther · January 06, 2013, 11:01:50 PM

Hi Alex,

here is my result:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
102	cycles for 1
95	cycles for 2 (dec ecx)
130	cycles for 3 (lea edi)
100	cycles for 4 (sbb edx)
129	cycles for 1
100	cycles for 2 (dec ecx)
130	cycles for 3 (lea edi)
129	cycles for 4 (sbb edx)
128	cycles for 1
95	cycles for 2 (dec ecx)
100	cycles for 3 (lea edi)
128	cycles for 4 (sbb edx)

--- ok ---

Gunther

FORTRANS · January 07, 2013, 12:36:21 AM

pre-P4 (SSE1)
129 cycles for 1
126 cycles for 2 (dec ecx)
137 cycles for 3 (lea edi)
128 cycles for 4 (sbb edx)
123 cycles for 1
124 cycles for 2 (dec ecx)
134 cycles for 3 (lea edi)
125 cycles for 4 (sbb edx)
122 cycles for 1
124 cycles for 2 (dec ecx)
134 cycles for 3 (lea edi)
125 cycles for 4 (sbb edx)

--- ok ---

six_L · January 07, 2013, 12:45:21 AM

QuoteIntel(R) Core(TM) i3 CPU M 370 @ 2.40GHz (SSE4)
76   cycles for 1
77   cycles for 2 (dec ecx)
75   cycles for 3 (lea edi)
75   cycles for 4 (sbb edx)
74   cycles for 1
76   cycles for 2 (dec ecx)
76   cycles for 3 (lea edi)
76   cycles for 4 (sbb edx)
74   cycles for 1
73   cycles for 2 (dec ecx)
75   cycles for 3 (lea edi)
75   cycles for 4 (sbb edx)

--- ok ---

dedndave · January 07, 2013, 01:26:02 AM

sinsi - my understanding is that a word gets "doubled" into a dword
some kind of exponentiation, actually

i can't think of when or why, but i have had need to do something like this in the past - lol
i think it can be done much faster, though :P
a 256-byte LUT would be some improvement
but, i think the "Sexy Stir-Fry" technique could be employed
similar to how this bit-reversing algo works

Code Select

;------------------------------------------------------------

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitReverse PROC dwVal:DWORD

;Dword Bit-Swap Algorithm
;Sexy Stir-Fry Technique
;
;The original concept was developed by BitRake and Nexo.
;This version was updated by Drizz and Lingo.

        pop     ecx
        mov     edx,0F0F0F0Fh
        pop     eax
        and     edx,eax
        and     eax,0F0F0F0F0h
        shl     edx,4
        shr     eax,4
        or      eax,edx
        mov     edx,33333333h
        and     edx,eax
        and     eax,0CCCCCCCCh
        shr     eax,2
        lea     eax,[edx*4+eax]
        mov     edx,55555555h
        and     edx,eax
        and     eax,0AAAAAAAAh
        shr     eax,1
        lea     eax,[edx*2+eax]
        bswap   eax
        jmp     ecx

BitReverse ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;------------------------------------------------------------

Antariy · January 07, 2013, 03:22:41 AM

Hi, Dave

Well, the bits swapping algo based on shifts to shuffle the bits, but it cannot be applied here. The only "similar" part is that both algos use shifting instructions (SHR/SHL/LEA) to get job done, but here we need literally to scale bit map, not to reverse it, i.e., when you reversing bits you are working with fixed pattern (map) of bits and "just" shifting and combining it, in scaling here you should duplicate the bit map, but with higher "resolution" - probably that's most suitable description. Like if you resizing 1bpp bitmap

Hi, sinsi, yes, any 16 bit (or less) number, which will be scaled (up to 32 bits for 16 bit number).

The MASM Forum

News:

Bits Scaling to 2

Antariy

dedndave

Antariy

sinsi

Antariy

dedndave

dedndave

Antariy

dedndave

sinsi

Gunther

FORTRANS

six_L

dedndave

Antariy