The MASM Forum

General => The Laboratory => Topic started by: Antariy on January 06, 2013, 12:25:19 PM

Title: Bits Scaling to 2
Post by: Antariy on January 06, 2013, 12:25:19 PM
Test for some variations of the proc to scale bits twice (double "width"). I.e., for instance, if we have the binary number: 1010Y then after processing it will be 11001100Y

Interesting to see how it will perform on different machines. The test processing full width - 16 bits to 32 bits, i.e. 16 iterations through the loop.


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
171     cycles for 1
205     cycles for 2 (dec ecx)
173     cycles for 3 (lea edi)
174     cycles for 4 (sbb edx)
169     cycles for 1
202     cycles for 2 (dec ecx)
175     cycles for 3 (lea edi)
176     cycles for 4 (sbb edx)
170     cycles for 1
203     cycles for 2 (dec ecx)
174     cycles for 3 (lea edi)
175     cycles for 4 (sbb edx)

--- ok ---



Also, please note on how the testbed is implemented: this is a kind of a suggestion. Each tested code piece was put in different section of the code, i.e., it starts at new page boundary (similar to align 1000h) and this should help to eliminate the influence of the code layout and placement.
Commented out buffer allocation and walking through all the bytes of the buffer at the start of the test - just to be used in the tests which require data processing.
Title: Re: Bits Scaling to 2
Post by: dedndave on January 06, 2013, 12:56:01 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
165     cycles for 1
197     cycles for 2 (dec ecx)
170     cycles for 3 (lea edi)
168     cycles for 4 (sbb edx)
163     cycles for 1
191     cycles for 2 (dec ecx)
168     cycles for 3 (lea edi)
168     cycles for 4 (sbb edx)
166     cycles for 1
199     cycles for 2 (dec ecx)
167     cycles for 3 (lea edi)
169     cycles for 4 (sbb edx)
Title: Re: Bits Scaling to 2
Post by: Antariy on January 06, 2013, 01:09:54 PM
Hi, Dave :t

It is interesting that the second variation of the code only differs by counter decrementation instruction (ADD ECX,-1 / DEC ECX), and overall it slower ~2 cycles per iteration.
Title: Re: Bits Scaling to 2
Post by: sinsi on January 06, 2013, 01:26:12 PM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
88      cycles for 1
77      cycles for 2 (dec ecx)
87      cycles for 3 (lea edi)
89      cycles for 4 (sbb edx)
86      cycles for 1
76      cycles for 2 (dec ecx)
86      cycles for 3 (lea edi)
89      cycles for 4 (sbb edx)
87      cycles for 1
76      cycles for 2 (dec ecx)
92      cycles for 3 (lea edi)
88      cycles for 4 (sbb edx)
Title: Re: Bits Scaling to 2
Post by: Antariy on January 06, 2013, 01:51:05 PM
Hi, sinsi :t

But AMD favors the things that Intel dislike, as usually :biggrin:
Title: Re: Bits Scaling to 2
Post by: dedndave on January 06, 2013, 01:58:18 PM
search the old forum for the words "stir fry"   :biggrin:
there was a routine in there for reversing the bits in a register
the method used may be applied, here
very cool algo
Title: Re: Bits Scaling to 2
Post by: dedndave on January 06, 2013, 02:07:04 PM
here you go, Alex
if you can understand how that algorithm works, i believe you can apply it to what you are doing   :t

http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224 (http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224)
Title: Re: Bits Scaling to 2
Post by: Antariy on January 06, 2013, 02:08:59 PM
No, that's not reversion of the bits, but just making them "twice wide" (i.e. 0101 becomes 00110011, 1101 becomes 11110011 etc).

But it is interesting to see that algo, thank you, Dave :t
Title: Re: Bits Scaling to 2
Post by: dedndave on January 06, 2013, 02:24:54 PM
i understand that
still, i think the algo could be modified to "double" the bits
Title: Re: Bits Scaling to 2
Post by: sinsi on January 06, 2013, 06:35:09 PM
What number of bits are we looking at? Maximum of 16 I guess but any number? Like 3 bits or 11?
Title: Re: Bits Scaling to 2
Post by: Gunther on January 06, 2013, 11:01:50 PM
Hi Alex,

here is my result:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
102 cycles for 1
95 cycles for 2 (dec ecx)
130 cycles for 3 (lea edi)
100 cycles for 4 (sbb edx)
129 cycles for 1
100 cycles for 2 (dec ecx)
130 cycles for 3 (lea edi)
129 cycles for 4 (sbb edx)
128 cycles for 1
95 cycles for 2 (dec ecx)
100 cycles for 3 (lea edi)
128 cycles for 4 (sbb edx)

--- ok ---


Gunther
Title: Re: Bits Scaling to 2
Post by: FORTRANS on January 07, 2013, 12:36:21 AM

pre-P4 (SSE1)
129     cycles for 1
126     cycles for 2 (dec ecx)
137     cycles for 3 (lea edi)
128     cycles for 4 (sbb edx)
123     cycles for 1
124     cycles for 2 (dec ecx)
134     cycles for 3 (lea edi)
125     cycles for 4 (sbb edx)
122     cycles for 1
124     cycles for 2 (dec ecx)
134     cycles for 3 (lea edi)
125     cycles for 4 (sbb edx)

--- ok ---
Title: Re: Bits Scaling to 2
Post by: six_L on January 07, 2013, 12:45:21 AM
QuoteIntel(R) Core(TM) i3 CPU       M 370  @ 2.40GHz (SSE4)
76   cycles for 1
77   cycles for 2 (dec ecx)
75   cycles for 3 (lea edi)
75   cycles for 4 (sbb edx)
74   cycles for 1
76   cycles for 2 (dec ecx)
76   cycles for 3 (lea edi)
76   cycles for 4 (sbb edx)
74   cycles for 1
73   cycles for 2 (dec ecx)
75   cycles for 3 (lea edi)
75   cycles for 4 (sbb edx)

--- ok ---

Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 01:26:02 AM
sinsi - my understanding is that a word gets "doubled" into a dword
some kind of exponentiation, actually

i can't think of when or why, but i have had need to do something like this in the past - lol
i think it can be done much faster, though   :P
a 256-byte LUT would be some improvement
but, i think the "Sexy Stir-Fry" technique could be employed
similar to how this bit-reversing algo works
;------------------------------------------------------------

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitReverse PROC dwVal:DWORD

;Dword Bit-Swap Algorithm
;Sexy Stir-Fry Technique
;
;The original concept was developed by BitRake and Nexo.
;This version was updated by Drizz and Lingo.

        pop     ecx
        mov     edx,0F0F0F0Fh
        pop     eax
        and     edx,eax
        and     eax,0F0F0F0F0h
        shl     edx,4
        shr     eax,4
        or      eax,edx
        mov     edx,33333333h
        and     edx,eax
        and     eax,0CCCCCCCCh
        shr     eax,2
        lea     eax,[edx*4+eax]
        mov     edx,55555555h
        and     edx,eax
        and     eax,0AAAAAAAAh
        shr     eax,1
        lea     eax,[edx*2+eax]
        bswap   eax
        jmp     ecx

BitReverse ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;------------------------------------------------------------
Title: Re: Bits Scaling to 2
Post by: Antariy on January 07, 2013, 03:22:41 AM
Hi, Dave :biggrin:

Well, the bits swapping algo based on shifts to shuffle the bits, but it cannot be applied here. The only "similar" part is that both algos use shifting instructions (SHR/SHL/LEA) to get job done, but here we need literally to scale bit map, not to reverse it, i.e., when you reversing bits you are working with fixed pattern (map) of bits and "just" shifting and combining it, in scaling here you should duplicate the bit map, but with higher "resolution" - probably that's most suitable description. Like if you resizing 1bpp bitmap :biggrin:

Hi, sinsi, yes, any 16 bit (or less) number, which will be scaled (up to 32 bits for 16 bit number).
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 03:49:06 AM
well - certainly, an LUT would be much faster
4 look-ups into a 256-byte table should be about 20-25 clock cycles or something   :P

i still think the sexy stir-fry would work, though - lol
Title: Re: Bits Scaling to 2
Post by: jj2007 on January 07, 2013, 04:12:43 AM
Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16->32?

1010 -> 11001100
0101 -> 00110011
10101010 -> 1100110011001100

As Dave already wrote, a 256-word LUT would be ideal...
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 04:18:27 AM
yah - that's probably the easiest way to make it fast
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at start-up   :t
Title: Re: Bits Scaling to 2
Post by: Gunther on January 07, 2013, 04:27:08 AM
Hi Dave,

Quote from: dedndave on January 07, 2013, 04:18:27 AM
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at start-up   :t

good idea. Such a thing is called initial calculation (computation).  :t

Gunther
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 04:29:37 AM
we just call in an "init routine"   :biggrin:

Jochen did something like that for his float-to-ascii thingy

"thingy" = highly technical term
if i tell you about it, i have to kill ya
Title: Re: Bits Scaling to 2
Post by: Gunther on January 07, 2013, 04:33:06 AM
Hi Dave,

"init" or "initial", there's not much difference at all.

Quote from: dedndave on January 07, 2013, 04:29:37 AM
Jochen did something like that for his float-to-ascii thingy

"thingy" = highly technical term
if i tell you about it, i have to kill ya

I know, I know.  :lol:

Gunther
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 04:45:54 AM
doh !

it wouldn't be a 256-byte table - lol
it would be 4 accesses on a 16-byte table - put it in initialized data   :t

you could go the next step up and make 2 accesses on a 256-word table
that one, you could make in unitialized data and use an "initial calculation (computation)" routine   :biggrin:
Title: Re: Bits Scaling to 2
Post by: jj2007 on January 07, 2013, 05:02:46 AM
Here is it:

LUT:
dw 0000000000000000b, 0000000000000011b, 0000000000001100b, 0000000000001111b
dw 0000000000110000b, 0000000000110011b, 0000000000111100b, 0000000000111111b
...
dw 1111111111000000b, 1111111111000011b, 1111111111001100b, 1111111111001111b
dw 1111111111110000b, 1111111111110011b, 1111111111111100b, 1111111111111111b


BuildLUT:
  mov edi, offset LUT
  xor ecx, ecx
  inc ch
  or edx, -1
  .Repeat
        inc edx
        xor eax, eax
        push 8
        .Repeat
                shl eax, 2
                rol dl, 1
                .if Carry?
                        or eax, 3        ; set 11
                .endif
                dec dword ptr [esp]
        .Until Zero?
        stosw
        pop eax
        dec ecx
  .Until Zero?
  retn
Title: Re: Bits Scaling to 2
Post by: Gunther on January 07, 2013, 05:50:15 AM
Good job, Jochen.  :t Thank you. I think that Alex will enjoy that.

Gunther
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 06:17:34 AM
i see what you mean, Alex
there is probably some mathematical way to do it
a good problem for the guys at The Euler Project   :P

waBdLut dw 256 dup(?)

InitBdLut PROC

        mov     edx,255
        push    ebx
        push    edi
        mov     ebx,edx
        mov     edi,offset waBdLut+508

IBLut0: mov     ecx,8

IBLut1: ror     edx,1
        rcr     eax,1
        shl     edx,1
        rcr     eax,1
        shr     edx,1
        dec     ecx
        jnz     IBLut1

        mov     [edi],eax
        sub     edi,2
        dec     ebx
        mov     edx,ebx
        jnz     IBLut0

        mov     [edi+2],bl
        pop     edi
        pop     ebx
        ret

InitBdLut ENDP


0000 0003 000C 000F FFF0 FFF3 FFFC FFFF
48 bytes of code
30047 30030 30011 29985 29998

Title: Re: Bits Scaling to 2
Post by: Antariy on January 07, 2013, 09:28:38 AM
Thanks for all the tests! :biggrin:

Quote from: jj2007 on January 07, 2013, 04:12:43 AM
Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16->32?

1010 -> 11001100
0101 -> 00110011
10101010 -> 1100110011001100

As Dave already wrote, a 256-word LUT would be ideal...

Hi Jochen, yes, there should be more code behind. This proc should be just collaborative proc to help to build LUT to use by the other algo :biggrin: So, the speed isn't a concern here (but it is interesting), but size is preferably. In the test as you see I have used the same algo in every test, just moved couple of instructions to see how it will perform. The test showed that SBB doesn't slows the code down much, DEC has different behaviours on different CPUs generations and models, as usual, and the timings are pretty stable - probably the method used (different sections of the code for every tested code) suited its task here.

Yes, the rule is every 1 bit becomes 2 bits.

Practically, this algo intents to build WORD-indexed LUT from BYTE-indexed LUT to avoid any computations (WORD->BYTE) in real time in the algo which will use that WORD index.

Something like this:

XMM0=0012 0034 0056 0078 009A 00BC 00DE 00F0
XMM1=0012 0000 0056 0000 0000 00BC 00DE 0000

PCMPEQW XMM0,XMM1

; XMM0 =  FFFF 0000 FFFF 0000 0000 FFFF FFFF 0000

PMOVMSKB EDX,XMM0

; EDX = 11 00 11 00 00 11 11 00

; [EDX+offset LUT] - some data corresponding to a pattern of the WORDs layout in the XMMs.


As you see, here we have doubled (word comparsion) resolution, and instead of using "resizing" it to one-bit-per-word and accessing 256 bytes LUT (8 bit index), it is faster to use 16 bit index LUT, which need to build from 256 byte LUT with the help of the the topic algo (to convert byte index to word index while filling the 16 bit indexed table).
Something like:

; EBX is a byte index
@@:
invoke ScaleBits,ebx,8
movzx edx,byte ptr [ebx+Lut256]
mov [eax+Lut64k],dl
inc ebx
loop @B



Hi Dave, I understand your point about speeding it up with a LUT, but in this example it is better to use mathematical way, because it then is a usage of the one algo to build the LUT of the other algo that builds the table of the next other algo :biggrin:
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 10:05:56 AM
i get 15 clock cycles on my p4 prescott w/htt   :P

BitDbl  PROC    dwVal:DWORD

        movzx   eax,byte ptr dwVal+1
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr dwVal
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret

BitDbl  ENDP
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 10:27:58 AM
i get 9 cycles for this one...

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitDbl  PROC    dwVal:DWORD

        movzx   eax,byte ptr [esp+5]
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr [esp+4]
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret     4

BitDbl  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef
Title: Re: Bits Scaling to 2
Post by: frktons on January 07, 2013, 11:00:02 AM
Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
116     cycles for 1
119     cycles for 2 (dec ecx)
119     cycles for 3 (lea edi)
123     cycles for 4 (sbb edx)
116     cycles for 1
120     cycles for 2 (dec ecx)
119     cycles for 3 (lea edi)
122     cycles for 4 (sbb edx)
116     cycles for 1
120     cycles for 2 (dec ecx)
119     cycles for 3 (lea edi)
133     cycles for 4 (sbb edx)

My tests for Alex routine.
Title: Re: Bits Scaling to 2
Post by: Antariy on January 07, 2013, 11:09:53 AM
Very cool, Dave :t
But still in the case using of the algo to build a table to be used with a collaborative (used once at the time of the process initialization) proc to build a table to be used with a "main" algo, is a "bit too-too (too much things)" :biggrin:
Also, the scalability of the scalling (pun) proc would be a good thing in the case, i.e. if we wanted to make scaling optional - 2, 4 or 8 times - for your algo that means building of a 3 different tables. But it is really fast (9...11 clocks on my CPU) :icon14:
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 11:50:36 AM
wow - didn't think it was that difficult
if that's too much for you, i suggest you stay with console apps   :lol:

but - you have options
for example, you could just make the table in the .DATA section - it's ony 512 bytes in size

another option is to compare the value of the last byte when you call the proc
if it isn't FFh, you call the routine to build the table
that way, it gets built automatically the first time you call it

now - as for 3 or 4 or more bits....
you never said anything about that   ;)
we can't read your mind - lol
Title: Re: Bits Scaling to 2
Post by: Antariy on January 07, 2013, 12:37:11 PM
Quote from: dedndave on January 07, 2013, 11:50:36 AM
another option is to compare the value of the last byte when you call the proc
if it isn't FFh, you call the routine to build the table
that way, it gets built automatically the first time you call it

now - as for 3 or 4 or more bits....
you never said anything about that   ;)
we can't read your mind - lol

Why to check lower byte? That's not the question of how big is table or index, the table just should be a "resized copy" of a smaller 256 byte time. I.e., most space in it will be just a dummy bytes. Example:

original table:

AA BB

CC DD

the processed table

AA 00 BB 00

CC 00 DD 00



But this is example, real table (not the topic algo) will contain over 65000 dummy bytes and only 256 bytes of the original, but "bit map resized" data.
It is required to avoid any index-fixing computations at runtime.

As for number of bits, yes, you are right, that was not said, but the algo contains a bit count parameter (not scalable algo yet, though), it is not unrolled at all etc what means the speed is not main concern, and the test was merely for seeing how such type of the code runs on dfferent CPUs, for asking a new mathematical ways to do this thing, and to conversate with you all (that's not joke) :P
Title: Re: Bits Scaling to 2
Post by: dedndave on January 07, 2013, 12:57:18 PM
what i meant was something like this...

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitDbl  PROC    dwVal:DWORD

        cmp byte ptr waBdLut+511,0FFh
        jz      BitDb0

        call    InitBdLut

BitDb0: movzx   eax,byte ptr [esp+5]
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr [esp+4]
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret     4

BitDbl  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef


that will add a few clock cycles
better to just define the table in .DATA   :P

but, if you want to handle a variety of bit-widths, you need some other way
Title: Re: Bits Scaling to 2
Post by: sinsi on January 07, 2013, 05:33:14 PM
This gives me 57 cycles for 16 bits

    push ebx
    push esi
    sub eax,eax
    mov edx,[esp+12]
    mov ecx,[esp+16]
    mov ebx,11b
@@: sub esi,esi
    bt edx,0
    cmovc esi,ebx
    shr edx,1
    or eax,esi
    shl ebx,2
    sub ecx,1
    jnz @b
    pop esi
    pop ebx
    ret 8

Title: Re: Bits Scaling to 2
Post by: Antariy on January 09, 2013, 04:36:06 AM
Hi sinsi :t

I have incorporated your code as well as added (probably crude) "universal" version to scale number of 2...16 bits by scale factor 16...2 (respectively).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
165     cycles for 1
198     cycles for 1.2 (dec ecx)
74      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
84      cycles for sinsi's mod 3 (AMD)
98      cycles for sinsi's mod 4 (AMD)
165     cycles for 1
200     cycles for 1.2 (dec ecx)
73      cycles for 2 (universal scaling proc) CMOV
109     cycles for 2.1 (universal scaling proc) i386
99      cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)
164     cycles for 1
197     cycles for 1.2 (dec ecx)
74      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)

--- ok ---
Title: Re: Bits Scaling to 2
Post by: dedndave on January 09, 2013, 08:43:52 AM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
167     cycles for 1
197     cycles for 1.2 (dec ecx)
79      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)
167     cycles for 1
197     cycles for 1.2 (dec ecx)
76      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
99      cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
85      cycles for sinsi's mod 4 (AMD)
167     cycles for 1
197     cycles for 1.2 (dec ecx)
75      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)
Title: Re: Bits Scaling to 2
Post by: Gunther on January 09, 2013, 09:12:57 AM
Hi Alex,

here the results:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
129 cycles for 1
100 cycles for 1.2 (dec ecx)
86 cycles for 2 (universal scaling proc) CMOV
102 cycles for 2.1 (universal scaling proc) i386
90 cycles for sinsi's
88 cycles for sinsi's mod
98 cycles for sinsi's mod 2
79 cycles for sinsi's mod 3 (AMD)
78 cycles for sinsi's mod 4 (AMD)
100 cycles for 1
129 cycles for 1.2 (dec ecx)
93 cycles for 2 (universal scaling proc) CMOV
84 cycles for 2.1 (universal scaling proc) i386
101 cycles for sinsi's
98 cycles for sinsi's mod
84 cycles for sinsi's mod 2
77 cycles for sinsi's mod 3 (AMD)
80 cycles for sinsi's mod 4 (AMD)
126 cycles for 1
95 cycles for 1.2 (dec ecx)
87 cycles for 2 (universal scaling proc) CMOV
102 cycles for 2.1 (universal scaling proc) i386
83 cycles for sinsi's
83 cycles for sinsi's mod
98 cycles for sinsi's mod 2
49 cycles for sinsi's mod 3 (AMD)
82 cycles for sinsi's mod 4 (AMD)

--- ok ---


Gunther
Title: Re: Bits Scaling to 2
Post by: Antariy on January 09, 2013, 09:25:59 AM
Hi Dave and Gunther :biggrin:
Thank you for the tests, I was posting in Gunther's thread and at that moment not noticed this topic updated.
Title: Re: Bits Scaling to 2
Post by: jj2007 on January 09, 2013, 09:29:15 AM
One more :biggrin:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
104     cycles for 1
122     cycles for 1.2 (dec ecx)
63      cycles for 2 (universal scaling proc) CMOV
80      cycles for 2.1 (universal scaling proc) i386
87      cycles for sinsi's
89      cycles for sinsi's mod
74      cycles for sinsi's mod 2
100     cycles for sinsi's mod 3 (AMD)
99      cycles for sinsi's mod 4 (AMD)
Title: Re: Bits Scaling to 2
Post by: sinsi on January 09, 2013, 08:01:27 PM
Funny how the (AMD) ones are slowest on the AMD  :lol:

AMD Phenom(tm) II X6 1100T Processor (SSE3)
87      cycles for 1
78      cycles for 1.2 (dec ecx)
43      cycles for 2 (universal scaling proc) CMOV
46      cycles for 2.1 (universal scaling proc) i386
57      cycles for sinsi's
57      cycles for sinsi's mod
56      cycles for sinsi's mod 2
57      cycles for sinsi's mod 3 (AMD)
56      cycles for sinsi's mod 4 (AMD)

Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz (SSE4)
75      cycles for 1
76      cycles for 1.2 (dec ecx)
43      cycles for 2 (universal scaling proc) CMOV
48      cycles for 2.1 (universal scaling proc) i386
65      cycles for sinsi's
55      cycles for sinsi's mod
54      cycles for sinsi's mod 2
48      cycles for sinsi's mod 3 (AMD)
48      cycles for sinsi's mod 4 (AMD)

Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz (SSE4)
63      cycles for 1
59      cycles for 1.2 (dec ecx)
47      cycles for 2 (universal scaling proc) CMOV
51      cycles for 2.1 (universal scaling proc) i386
45      cycles for sinsi's
43      cycles for sinsi's mod
42      cycles for sinsi's mod 2
36      cycles for sinsi's mod 3 (AMD)
33      cycles for sinsi's mod 4 (AMD)

Title: Re: Bits Scaling to 2
Post by: FORTRANS on January 10, 2013, 01:59:54 AM

pre-P4 (SSE1)
126     cycles for 1
126     cycles for 1.2 (dec ecx)
87      cycles for 2 (universal scaling proc) CMOV
88      cycles for 2.1 (universal scaling proc) i386
102     cycles for sinsi's
97      cycles for sinsi's mod
102     cycles for sinsi's mod 2
97      cycles for sinsi's mod 3 (AMD)
91      cycles for sinsi's mod 4 (AMD)
122     cycles for 1
124     cycles for 1.2 (dec ecx)
84      cycles for 2 (universal scaling proc) CMOV
85      cycles for 2.1 (universal scaling proc) i386
96      cycles for sinsi's
94      cycles for sinsi's mod
99      cycles for sinsi's mod 2
94      cycles for sinsi's mod 3 (AMD)
88      cycles for sinsi's mod 4 (AMD)
122     cycles for 1
124     cycles for 1.2 (dec ecx)
84      cycles for 2 (universal scaling proc) CMOV
85      cycles for 2.1 (universal scaling proc) i386
96      cycles for sinsi's
94      cycles for sinsi's mod
99      cycles for sinsi's mod 2
95      cycles for sinsi's mod 3 (AMD)
88      cycles for sinsi's mod 4 (AMD)

--- ok ---

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
106   cycles for 1
110   cycles for 1.2 (dec ecx)
71   cycles for 2 (universal scaling proc) CMOV
68   cycles for 2.1 (universal scaling proc) i386
82   cycles for sinsi's
79   cycles for sinsi's mod
86   cycles for sinsi's mod 2
75   cycles for sinsi's mod 3 (AMD)
76   cycles for sinsi's mod 4 (AMD)
106   cycles for 1
110   cycles for 1.2 (dec ecx)
72   cycles for 2 (universal scaling proc) CMOV
67   cycles for 2.1 (universal scaling proc) i386
82   cycles for sinsi's
79   cycles for sinsi's mod
86   cycles for sinsi's mod 2
75   cycles for sinsi's mod 3 (AMD)
77   cycles for sinsi's mod 4 (AMD)
106   cycles for 1
110   cycles for 1.2 (dec ecx)
71   cycles for 2 (universal scaling proc) CMOV
68   cycles for 2.1 (universal scaling proc) i386
83   cycles for sinsi's
79   cycles for sinsi's mod
86   cycles for sinsi's mod 2
76   cycles for sinsi's mod 3 (AMD)
76   cycles for sinsi's mod 4 (AMD)

--- ok ---
Title: Re: Bits Scaling to 2
Post by: dedndave on January 10, 2013, 03:48:51 AM
i had an idea, this morning   :biggrin:
the results aren't as fast, which is to be expected
but the function is very flexible...
Quote;Flexible Bit Stretcher - DednDave - 1, 2013
;
;  Creates variable-multiples of bits from an input word.
;The repeat count for each input bit is derived as follows:
; (1) high-order count bit from high word of wParam
; (2) low-order count bits from lParam
;These 3 bits allow for individual repeat counts of 0 to 7 for each input bit.

these are the results on my prescott if i set it up to double all the bits...
00001h: 496 498 498 497 497
000FFh: 593 594 593 590 613
0AAAAh: 596 598 599 596 599
0FF00h: 593 595 594 592 593
0FFFFh: 640 639 641 640 641
Title: Re: Bits Scaling to 2
Post by: Antariy on January 10, 2013, 01:16:24 PM
Thanks to all for your tests! :biggrin:

Hi Dave :t
Your test:

00001h: 538 540 523 524 530
000FFh: 617 618 618 623 624
0AAAAh: 633 636 647 630 631
0FF00h: 639 624 638 641 645
0FFFFh: 679 674 678 675 674
Press any key to continue ...
Title: Re: Bits Scaling to 2
Post by: dedndave on January 10, 2013, 01:29:32 PM
hi Alex
what do you think of the idea, though   :P

i was thinking of ways it might be used
among other things, it could be used to create complex state machines, i guess
it could be used to create variable bit-masks
Title: Re: Bits Scaling to 2
Post by: Antariy on January 10, 2013, 01:39:50 PM
I have not read the code yet, Dave :biggrin:
From what you said it sounds good.