News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Bits Scaling to 2

Started by Antariy, January 06, 2013, 12:25:19 PM

Previous topic - Next topic

Antariy

Test for some variations of the proc to scale bits twice (double "width"). I.e., for instance, if we have the binary number: 1010Y then after processing it will be 11001100Y

Interesting to see how it will perform on different machines. The test processing full width - 16 bits to 32 bits, i.e. 16 iterations through the loop.


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
171     cycles for 1
205     cycles for 2 (dec ecx)
173     cycles for 3 (lea edi)
174     cycles for 4 (sbb edx)
169     cycles for 1
202     cycles for 2 (dec ecx)
175     cycles for 3 (lea edi)
176     cycles for 4 (sbb edx)
170     cycles for 1
203     cycles for 2 (dec ecx)
174     cycles for 3 (lea edi)
175     cycles for 4 (sbb edx)

--- ok ---



Also, please note on how the testbed is implemented: this is a kind of a suggestion. Each tested code piece was put in different section of the code, i.e., it starts at new page boundary (similar to align 1000h) and this should help to eliminate the influence of the code layout and placement.
Commented out buffer allocation and walking through all the bytes of the buffer at the start of the test - just to be used in the tests which require data processing.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
165     cycles for 1
197     cycles for 2 (dec ecx)
170     cycles for 3 (lea edi)
168     cycles for 4 (sbb edx)
163     cycles for 1
191     cycles for 2 (dec ecx)
168     cycles for 3 (lea edi)
168     cycles for 4 (sbb edx)
166     cycles for 1
199     cycles for 2 (dec ecx)
167     cycles for 3 (lea edi)
169     cycles for 4 (sbb edx)

Antariy

Hi, Dave :t

It is interesting that the second variation of the code only differs by counter decrementation instruction (ADD ECX,-1 / DEC ECX), and overall it slower ~2 cycles per iteration.

sinsi

AMD Phenom(tm) II X6 1100T Processor (SSE3)
88      cycles for 1
77      cycles for 2 (dec ecx)
87      cycles for 3 (lea edi)
89      cycles for 4 (sbb edx)
86      cycles for 1
76      cycles for 2 (dec ecx)
86      cycles for 3 (lea edi)
89      cycles for 4 (sbb edx)
87      cycles for 1
76      cycles for 2 (dec ecx)
92      cycles for 3 (lea edi)
88      cycles for 4 (sbb edx)

Antariy

Hi, sinsi :t

But AMD favors the things that Intel dislike, as usually :biggrin:

dedndave

search the old forum for the words "stir fry"   :biggrin:
there was a routine in there for reversing the bits in a register
the method used may be applied, here
very cool algo

dedndave

here you go, Alex
if you can understand how that algorithm works, i believe you can apply it to what you are doing   :t

http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224

Antariy

No, that's not reversion of the bits, but just making them "twice wide" (i.e. 0101 becomes 00110011, 1101 becomes 11110011 etc).

But it is interesting to see that algo, thank you, Dave :t

dedndave

i understand that
still, i think the algo could be modified to "double" the bits

sinsi

What number of bits are we looking at? Maximum of 16 I guess but any number? Like 3 bits or 11?

Gunther

Hi Alex,

here is my result:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
102 cycles for 1
95 cycles for 2 (dec ecx)
130 cycles for 3 (lea edi)
100 cycles for 4 (sbb edx)
129 cycles for 1
100 cycles for 2 (dec ecx)
130 cycles for 3 (lea edi)
129 cycles for 4 (sbb edx)
128 cycles for 1
95 cycles for 2 (dec ecx)
100 cycles for 3 (lea edi)
128 cycles for 4 (sbb edx)

--- ok ---


Gunther
You have to know the facts before you can distort them.

FORTRANS


pre-P4 (SSE1)
129     cycles for 1
126     cycles for 2 (dec ecx)
137     cycles for 3 (lea edi)
128     cycles for 4 (sbb edx)
123     cycles for 1
124     cycles for 2 (dec ecx)
134     cycles for 3 (lea edi)
125     cycles for 4 (sbb edx)
122     cycles for 1
124     cycles for 2 (dec ecx)
134     cycles for 3 (lea edi)
125     cycles for 4 (sbb edx)

--- ok ---

six_L

QuoteIntel(R) Core(TM) i3 CPU       M 370  @ 2.40GHz (SSE4)
76   cycles for 1
77   cycles for 2 (dec ecx)
75   cycles for 3 (lea edi)
75   cycles for 4 (sbb edx)
74   cycles for 1
76   cycles for 2 (dec ecx)
76   cycles for 3 (lea edi)
76   cycles for 4 (sbb edx)
74   cycles for 1
73   cycles for 2 (dec ecx)
75   cycles for 3 (lea edi)
75   cycles for 4 (sbb edx)

--- ok ---

Say you, Say me, Say the codes together for ever.

dedndave

sinsi - my understanding is that a word gets "doubled" into a dword
some kind of exponentiation, actually

i can't think of when or why, but i have had need to do something like this in the past - lol
i think it can be done much faster, though   :P
a 256-byte LUT would be some improvement
but, i think the "Sexy Stir-Fry" technique could be employed
similar to how this bit-reversing algo works
;------------------------------------------------------------

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitReverse PROC dwVal:DWORD

;Dword Bit-Swap Algorithm
;Sexy Stir-Fry Technique
;
;The original concept was developed by BitRake and Nexo.
;This version was updated by Drizz and Lingo.

        pop     ecx
        mov     edx,0F0F0F0Fh
        pop     eax
        and     edx,eax
        and     eax,0F0F0F0F0h
        shl     edx,4
        shr     eax,4
        or      eax,edx
        mov     edx,33333333h
        and     edx,eax
        and     eax,0CCCCCCCCh
        shr     eax,2
        lea     eax,[edx*4+eax]
        mov     edx,55555555h
        and     edx,eax
        and     eax,0AAAAAAAAh
        shr     eax,1
        lea     eax,[edx*2+eax]
        bswap   eax
        jmp     ecx

BitReverse ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;------------------------------------------------------------

Antariy

Hi, Dave :biggrin:

Well, the bits swapping algo based on shifts to shuffle the bits, but it cannot be applied here. The only "similar" part is that both algos use shifting instructions (SHR/SHL/LEA) to get job done, but here we need literally to scale bit map, not to reverse it, i.e., when you reversing bits you are working with fixed pattern (map) of bits and "just" shifting and combining it, in scaling here you should duplicate the bit map, but with higher "resolution" - probably that's most suitable description. Like if you resizing 1bpp bitmap :biggrin:

Hi, sinsi, yes, any 16 bit (or less) number, which will be scaled (up to 32 bits for 16 bit number).