The MASM Forum
General => The Laboratory => Topic started by: Antariy on January 06, 2013, 12:25:19 PM

Test for some variations of the proc to scale bits twice (double "width"). I.e., for instance, if we have the binary number: 1010Y then after processing it will be 11001100Y
Interesting to see how it will perform on different machines. The test processing full width  16 bits to 32 bits, i.e. 16 iterations through the loop.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
171 cycles for 1
205 cycles for 2 (dec ecx)
173 cycles for 3 (lea edi)
174 cycles for 4 (sbb edx)
169 cycles for 1
202 cycles for 2 (dec ecx)
175 cycles for 3 (lea edi)
176 cycles for 4 (sbb edx)
170 cycles for 1
203 cycles for 2 (dec ecx)
174 cycles for 3 (lea edi)
175 cycles for 4 (sbb edx)
 ok 
Also, please note on how the testbed is implemented: this is a kind of a suggestion. Each tested code piece was put in different section of the code, i.e., it starts at new page boundary (similar to align 1000h) and this should help to eliminate the influence of the code layout and placement.
Commented out buffer allocation and walking through all the bytes of the buffer at the start of the test  just to be used in the tests which require data processing.

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
165 cycles for 1
197 cycles for 2 (dec ecx)
170 cycles for 3 (lea edi)
168 cycles for 4 (sbb edx)
163 cycles for 1
191 cycles for 2 (dec ecx)
168 cycles for 3 (lea edi)
168 cycles for 4 (sbb edx)
166 cycles for 1
199 cycles for 2 (dec ecx)
167 cycles for 3 (lea edi)
169 cycles for 4 (sbb edx)

Hi, Dave :t
It is interesting that the second variation of the code only differs by counter decrementation instruction (ADD ECX,1 / DEC ECX), and overall it slower ~2 cycles per iteration.

AMD Phenom(tm) II X6 1100T Processor (SSE3)
88 cycles for 1
77 cycles for 2 (dec ecx)
87 cycles for 3 (lea edi)
89 cycles for 4 (sbb edx)
86 cycles for 1
76 cycles for 2 (dec ecx)
86 cycles for 3 (lea edi)
89 cycles for 4 (sbb edx)
87 cycles for 1
76 cycles for 2 (dec ecx)
92 cycles for 3 (lea edi)
88 cycles for 4 (sbb edx)

Hi, sinsi :t
But AMD favors the things that Intel dislike, as usually :biggrin:

search the old forum for the words "stir fry" :biggrin:
there was a routine in there for reversing the bits in a register
the method used may be applied, here
very cool algo

here you go, Alex
if you can understand how that algorithm works, i believe you can apply it to what you are doing :t
http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224 (http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224)

No, that's not reversion of the bits, but just making them "twice wide" (i.e. 0101 becomes 00110011, 1101 becomes 11110011 etc).
But it is interesting to see that algo, thank you, Dave :t

i understand that
still, i think the algo could be modified to "double" the bits

What number of bits are we looking at? Maximum of 16 I guess but any number? Like 3 bits or 11?

Hi Alex,
here is my result:
Intel(R) Core(TM) i73770 CPU @ 3.40GHz (SSE4)
102 cycles for 1
95 cycles for 2 (dec ecx)
130 cycles for 3 (lea edi)
100 cycles for 4 (sbb edx)
129 cycles for 1
100 cycles for 2 (dec ecx)
130 cycles for 3 (lea edi)
129 cycles for 4 (sbb edx)
128 cycles for 1
95 cycles for 2 (dec ecx)
100 cycles for 3 (lea edi)
128 cycles for 4 (sbb edx)
 ok 
Gunther

preP4 (SSE1)
129 cycles for 1
126 cycles for 2 (dec ecx)
137 cycles for 3 (lea edi)
128 cycles for 4 (sbb edx)
123 cycles for 1
124 cycles for 2 (dec ecx)
134 cycles for 3 (lea edi)
125 cycles for 4 (sbb edx)
122 cycles for 1
124 cycles for 2 (dec ecx)
134 cycles for 3 (lea edi)
125 cycles for 4 (sbb edx)
 ok 

Intel(R) Core(TM) i3 CPU M 370 @ 2.40GHz (SSE4)
76 cycles for 1
77 cycles for 2 (dec ecx)
75 cycles for 3 (lea edi)
75 cycles for 4 (sbb edx)
74 cycles for 1
76 cycles for 2 (dec ecx)
76 cycles for 3 (lea edi)
76 cycles for 4 (sbb edx)
74 cycles for 1
73 cycles for 2 (dec ecx)
75 cycles for 3 (lea edi)
75 cycles for 4 (sbb edx)
 ok 

sinsi  my understanding is that a word gets "doubled" into a dword
some kind of exponentiation, actually
i can't think of when or why, but i have had need to do something like this in the past  lol
i think it can be done much faster, though :P
a 256byte LUT would be some improvement
but, i think the "Sexy StirFry" technique could be employed
similar to how this bitreversing algo works
;
OPTION PROLOGUE:None
OPTION EPILOGUE:None
BitReverse PROC dwVal:DWORD
;Dword BitSwap Algorithm
;Sexy StirFry Technique
;
;The original concept was developed by BitRake and Nexo.
;This version was updated by Drizz and Lingo.
pop ecx
mov edx,0F0F0F0Fh
pop eax
and edx,eax
and eax,0F0F0F0F0h
shl edx,4
shr eax,4
or eax,edx
mov edx,33333333h
and edx,eax
and eax,0CCCCCCCCh
shr eax,2
lea eax,[edx*4+eax]
mov edx,55555555h
and edx,eax
and eax,0AAAAAAAAh
shr eax,1
lea eax,[edx*2+eax]
bswap eax
jmp ecx
BitReverse ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;

Hi, Dave :biggrin:
Well, the bits swapping algo based on shifts to shuffle the bits, but it cannot be applied here. The only "similar" part is that both algos use shifting instructions (SHR/SHL/LEA) to get job done, but here we need literally to scale bit map, not to reverse it, i.e., when you reversing bits you are working with fixed pattern (map) of bits and "just" shifting and combining it, in scaling here you should duplicate the bit map, but with higher "resolution"  probably that's most suitable description. Like if you resizing 1bpp bitmap :biggrin:
Hi, sinsi, yes, any 16 bit (or less) number, which will be scaled (up to 32 bits for 16 bit number).

well  certainly, an LUT would be much faster
4 lookups into a 256byte table should be about 2025 clock cycles or something :P
i still think the sexy stirfry would work, though  lol

Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16>32?
1010 > 11001100
0101 > 00110011
10101010 > 1100110011001100
As Dave already wrote, a 256word LUT would be ideal...

yah  that's probably the easiest way to make it fast
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at startup :t

Hi Dave,
you don't have to carry the table in initialized data, either
make a little routine to build the table in uninitialized data at startup :t
good idea. Such a thing is called initial calculation (computation). :t
Gunther

we just call in an "init routine" :biggrin:
Jochen did something like that for his floattoascii thingy
"thingy" = highly technical term
if i tell you about it, i have to kill ya

Hi Dave,
"init" or "initial", there's not much difference at all.
Jochen did something like that for his floattoascii thingy
"thingy" = highly technical term
if i tell you about it, i have to kill ya
I know, I know. :lol:
Gunther

doh !
it wouldn't be a 256byte table  lol
it would be 4 accesses on a 16byte table  put it in initialized data :t
you could go the next step up and make 2 accesses on a 256word table
that one, you could make in unitialized data and use an "initial calculation (computation)" routine :biggrin:

Here is it:
LUT:
dw 0000000000000000b, 0000000000000011b, 0000000000001100b, 0000000000001111b
dw 0000000000110000b, 0000000000110011b, 0000000000111100b, 0000000000111111b
...
dw 1111111111000000b, 1111111111000011b, 1111111111001100b, 1111111111001111b
dw 1111111111110000b, 1111111111110011b, 1111111111111100b, 1111111111111111b
BuildLUT:
mov edi, offset LUT
xor ecx, ecx
inc ch
or edx, 1
.Repeat
inc edx
xor eax, eax
push 8
.Repeat
shl eax, 2
rol dl, 1
.if Carry?
or eax, 3 ; set 11
.endif
dec dword ptr [esp]
.Until Zero?
stosw
pop eax
dec ecx
.Until Zero?
retn

Good job, Jochen. :t Thank you. I think that Alex will enjoy that.
Gunther

i see what you mean, Alex
there is probably some mathematical way to do it
a good problem for the guys at The Euler Project :P
waBdLut dw 256 dup(?)
InitBdLut PROC
mov edx,255
push ebx
push edi
mov ebx,edx
mov edi,offset waBdLut+508
IBLut0: mov ecx,8
IBLut1: ror edx,1
rcr eax,1
shl edx,1
rcr eax,1
shr edx,1
dec ecx
jnz IBLut1
mov [edi],eax
sub edi,2
dec ebx
mov edx,ebx
jnz IBLut0
mov [edi+2],bl
pop edi
pop ebx
ret
InitBdLut ENDP
0000 0003 000C 000F FFF0 FFF3 FFFC FFFF
48 bytes of code
30047 30030 30011 29985 29998

Thanks for all the tests! :biggrin:
Hi Alex,
Just out of curiosity: Is there a real life app behind?
And what is the game's rule? 8 bit becomes 16 bit, 16>32?
1010 > 11001100
0101 > 00110011
10101010 > 1100110011001100
As Dave already wrote, a 256word LUT would be ideal...
Hi Jochen, yes, there should be more code behind. This proc should be just collaborative proc to help to build LUT to use by the other algo :biggrin: So, the speed isn't a concern here (but it is interesting), but size is preferably. In the test as you see I have used the same algo in every test, just moved couple of instructions to see how it will perform. The test showed that SBB doesn't slows the code down much, DEC has different behaviours on different CPUs generations and models, as usual, and the timings are pretty stable  probably the method used (different sections of the code for every tested code) suited its task here.
Yes, the rule is every 1 bit becomes 2 bits.
Practically, this algo intents to build WORDindexed LUT from BYTEindexed LUT to avoid any computations (WORD>BYTE) in real time in the algo which will use that WORD index.
Something like this:
XMM0=0012 0034 0056 0078 009A 00BC 00DE 00F0
XMM1=0012 0000 0056 0000 0000 00BC 00DE 0000
PCMPEQW XMM0,XMM1
; XMM0 = FFFF 0000 FFFF 0000 0000 FFFF FFFF 0000
PMOVMSKB EDX,XMM0
; EDX = 11 00 11 00 00 11 11 00
; [EDX+offset LUT]  some data corresponding to a pattern of the WORDs layout in the XMMs.
As you see, here we have doubled (word comparsion) resolution, and instead of using "resizing" it to onebitperword and accessing 256 bytes LUT (8 bit index), it is faster to use 16 bit index LUT, which need to build from 256 byte LUT with the help of the the topic algo (to convert byte index to word index while filling the 16 bit indexed table).
Something like:
; EBX is a byte index
@@:
invoke ScaleBits,ebx,8
movzx edx,byte ptr [ebx+Lut256]
mov [eax+Lut64k],dl
inc ebx
loop @B
Hi Dave, I understand your point about speeding it up with a LUT, but in this example it is better to use mathematical way, because it then is a usage of the one algo to build the LUT of the other algo that builds the table of the next other algo :biggrin:

i get 15 clock cycles on my p4 prescott w/htt :P
BitDbl PROC dwVal:DWORD
movzx eax,byte ptr dwVal+1
mov ecx,offset waBdLut
dec eax
movzx edx,byte ptr dwVal
mov eax,[ecx+2*eax]
mov ax,[ecx+2*edx]
ret
BitDbl ENDP

i get 9 cycles for this one...
OPTION PROLOGUE:None
OPTION EPILOGUE:None
BitDbl PROC dwVal:DWORD
movzx eax,byte ptr [esp+5]
mov ecx,offset waBdLut
dec eax
movzx edx,byte ptr [esp+4]
mov eax,[ecx+2*eax]
mov ax,[ecx+2*edx]
ret 4
BitDbl ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
116 cycles for 1
119 cycles for 2 (dec ecx)
119 cycles for 3 (lea edi)
123 cycles for 4 (sbb edx)
116 cycles for 1
120 cycles for 2 (dec ecx)
119 cycles for 3 (lea edi)
122 cycles for 4 (sbb edx)
116 cycles for 1
120 cycles for 2 (dec ecx)
119 cycles for 3 (lea edi)
133 cycles for 4 (sbb edx)
My tests for Alex routine.

Very cool, Dave :t
But still in the case using of the algo to build a table to be used with a collaborative (used once at the time of the process initialization) proc to build a table to be used with a "main" algo, is a "bit tootoo (too much things)" :biggrin:
Also, the scalability of the scalling (pun) proc would be a good thing in the case, i.e. if we wanted to make scaling optional  2, 4 or 8 times  for your algo that means building of a 3 different tables. But it is really fast (9...11 clocks on my CPU) :icon14:

wow  didn't think it was that difficult
if that's too much for you, i suggest you stay with console apps :lol:
but  you have options
for example, you could just make the table in the .DATA section  it's ony 512 bytes in size
another option is to compare the value of the last byte when you call the proc
if it isn't FFh, you call the routine to build the table
that way, it gets built automatically the first time you call it
now  as for 3 or 4 or more bits....
you never said anything about that ;)
we can't read your mind  lol

another option is to compare the value of the last byte when you call the proc
if it isn't FFh, you call the routine to build the table
that way, it gets built automatically the first time you call it
now  as for 3 or 4 or more bits....
you never said anything about that ;)
we can't read your mind  lol
Why to check lower byte? That's not the question of how big is table or index, the table just should be a "resized copy" of a smaller 256 byte time. I.e., most space in it will be just a dummy bytes. Example:
original table:
AA BB
CC DD
the processed table
AA 00 BB 00
CC 00 DD 00
But this is example, real table (not the topic algo) will contain over 65000 dummy bytes and only 256 bytes of the original, but "bit map resized" data.
It is required to avoid any indexfixing computations at runtime.
As for number of bits, yes, you are right, that was not said, but the algo contains a bit count parameter (not scalable algo yet, though), it is not unrolled at all etc what means the speed is not main concern, and the test was merely for seeing how such type of the code runs on dfferent CPUs, for asking a new mathematical ways to do this thing, and to conversate with you all (that's not joke) :P

what i meant was something like this...
OPTION PROLOGUE:None
OPTION EPILOGUE:None
BitDbl PROC dwVal:DWORD
cmp byte ptr waBdLut+511,0FFh
jz BitDb0
call InitBdLut
BitDb0: movzx eax,byte ptr [esp+5]
mov ecx,offset waBdLut
dec eax
movzx edx,byte ptr [esp+4]
mov eax,[ecx+2*eax]
mov ax,[ecx+2*edx]
ret 4
BitDbl ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
that will add a few clock cycles
better to just define the table in .DATA :P
but, if you want to handle a variety of bitwidths, you need some other way

This gives me 57 cycles for 16 bits
push ebx
push esi
sub eax,eax
mov edx,[esp+12]
mov ecx,[esp+16]
mov ebx,11b
@@: sub esi,esi
bt edx,0
cmovc esi,ebx
shr edx,1
or eax,esi
shl ebx,2
sub ecx,1
jnz @b
pop esi
pop ebx
ret 8

Hi sinsi :t
I have incorporated your code as well as added (probably crude) "universal" version to scale number of 2...16 bits by scale factor 16...2 (respectively).
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
165 cycles for 1
198 cycles for 1.2 (dec ecx)
74 cycles for 2 (universal scaling proc) CMOV
108 cycles for 2.1 (universal scaling proc) i386
100 cycles for sinsi's
76 cycles for sinsi's mod
76 cycles for sinsi's mod 2
84 cycles for sinsi's mod 3 (AMD)
98 cycles for sinsi's mod 4 (AMD)
165 cycles for 1
200 cycles for 1.2 (dec ecx)
73 cycles for 2 (universal scaling proc) CMOV
109 cycles for 2.1 (universal scaling proc) i386
99 cycles for sinsi's
76 cycles for sinsi's mod
76 cycles for sinsi's mod 2
82 cycles for sinsi's mod 3 (AMD)
81 cycles for sinsi's mod 4 (AMD)
164 cycles for 1
197 cycles for 1.2 (dec ecx)
74 cycles for 2 (universal scaling proc) CMOV
108 cycles for 2.1 (universal scaling proc) i386
100 cycles for sinsi's
76 cycles for sinsi's mod
76 cycles for sinsi's mod 2
82 cycles for sinsi's mod 3 (AMD)
81 cycles for sinsi's mod 4 (AMD)
 ok 

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
167 cycles for 1
197 cycles for 1.2 (dec ecx)
79 cycles for 2 (universal scaling proc) CMOV
108 cycles for 2.1 (universal scaling proc) i386
100 cycles for sinsi's
76 cycles for sinsi's mod
76 cycles for sinsi's mod 2
82 cycles for sinsi's mod 3 (AMD)
81 cycles for sinsi's mod 4 (AMD)
167 cycles for 1
197 cycles for 1.2 (dec ecx)
76 cycles for 2 (universal scaling proc) CMOV
108 cycles for 2.1 (universal scaling proc) i386
99 cycles for sinsi's
76 cycles for sinsi's mod
76 cycles for sinsi's mod 2
82 cycles for sinsi's mod 3 (AMD)
85 cycles for sinsi's mod 4 (AMD)
167 cycles for 1
197 cycles for 1.2 (dec ecx)
75 cycles for 2 (universal scaling proc) CMOV
108 cycles for 2.1 (universal scaling proc) i386
100 cycles for sinsi's
76 cycles for sinsi's mod
76 cycles for sinsi's mod 2
82 cycles for sinsi's mod 3 (AMD)
81 cycles for sinsi's mod 4 (AMD)

Hi Alex,
here the results:
Intel(R) Core(TM) i73770 CPU @ 3.40GHz (SSE4)
129 cycles for 1
100 cycles for 1.2 (dec ecx)
86 cycles for 2 (universal scaling proc) CMOV
102 cycles for 2.1 (universal scaling proc) i386
90 cycles for sinsi's
88 cycles for sinsi's mod
98 cycles for sinsi's mod 2
79 cycles for sinsi's mod 3 (AMD)
78 cycles for sinsi's mod 4 (AMD)
100 cycles for 1
129 cycles for 1.2 (dec ecx)
93 cycles for 2 (universal scaling proc) CMOV
84 cycles for 2.1 (universal scaling proc) i386
101 cycles for sinsi's
98 cycles for sinsi's mod
84 cycles for sinsi's mod 2
77 cycles for sinsi's mod 3 (AMD)
80 cycles for sinsi's mod 4 (AMD)
126 cycles for 1
95 cycles for 1.2 (dec ecx)
87 cycles for 2 (universal scaling proc) CMOV
102 cycles for 2.1 (universal scaling proc) i386
83 cycles for sinsi's
83 cycles for sinsi's mod
98 cycles for sinsi's mod 2
49 cycles for sinsi's mod 3 (AMD)
82 cycles for sinsi's mod 4 (AMD)
 ok 
Gunther

Hi Dave and Gunther :biggrin:
Thank you for the tests, I was posting in Gunther's thread and at that moment not noticed this topic updated.

One more :biggrin:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
104 cycles for 1
122 cycles for 1.2 (dec ecx)
63 cycles for 2 (universal scaling proc) CMOV
80 cycles for 2.1 (universal scaling proc) i386
87 cycles for sinsi's
89 cycles for sinsi's mod
74 cycles for sinsi's mod 2
100 cycles for sinsi's mod 3 (AMD)
99 cycles for sinsi's mod 4 (AMD)

Funny how the (AMD) ones are slowest on the AMD :lol:
AMD Phenom(tm) II X6 1100T Processor (SSE3)
87 cycles for 1
78 cycles for 1.2 (dec ecx)
43 cycles for 2 (universal scaling proc) CMOV
46 cycles for 2.1 (universal scaling proc) i386
57 cycles for sinsi's
57 cycles for sinsi's mod
56 cycles for sinsi's mod 2
57 cycles for sinsi's mod 3 (AMD)
56 cycles for sinsi's mod 4 (AMD)
Intel(R) Core(TM) i3 CPU M 380 @ 2.53GHz (SSE4)
75 cycles for 1
76 cycles for 1.2 (dec ecx)
43 cycles for 2 (universal scaling proc) CMOV
48 cycles for 2.1 (universal scaling proc) i386
65 cycles for sinsi's
55 cycles for sinsi's mod
54 cycles for sinsi's mod 2
48 cycles for sinsi's mod 3 (AMD)
48 cycles for sinsi's mod 4 (AMD)
Intel(R) Core(TM) i72700K CPU @ 3.50GHz (SSE4)
63 cycles for 1
59 cycles for 1.2 (dec ecx)
47 cycles for 2 (universal scaling proc) CMOV
51 cycles for 2.1 (universal scaling proc) i386
45 cycles for sinsi's
43 cycles for sinsi's mod
42 cycles for sinsi's mod 2
36 cycles for sinsi's mod 3 (AMD)
33 cycles for sinsi's mod 4 (AMD)

preP4 (SSE1)
126 cycles for 1
126 cycles for 1.2 (dec ecx)
87 cycles for 2 (universal scaling proc) CMOV
88 cycles for 2.1 (universal scaling proc) i386
102 cycles for sinsi's
97 cycles for sinsi's mod
102 cycles for sinsi's mod 2
97 cycles for sinsi's mod 3 (AMD)
91 cycles for sinsi's mod 4 (AMD)
122 cycles for 1
124 cycles for 1.2 (dec ecx)
84 cycles for 2 (universal scaling proc) CMOV
85 cycles for 2.1 (universal scaling proc) i386
96 cycles for sinsi's
94 cycles for sinsi's mod
99 cycles for sinsi's mod 2
94 cycles for sinsi's mod 3 (AMD)
88 cycles for sinsi's mod 4 (AMD)
122 cycles for 1
124 cycles for 1.2 (dec ecx)
84 cycles for 2 (universal scaling proc) CMOV
85 cycles for 2.1 (universal scaling proc) i386
96 cycles for sinsi's
94 cycles for sinsi's mod
99 cycles for sinsi's mod 2
95 cycles for sinsi's mod 3 (AMD)
88 cycles for sinsi's mod 4 (AMD)
 ok 
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
106 cycles for 1
110 cycles for 1.2 (dec ecx)
71 cycles for 2 (universal scaling proc) CMOV
68 cycles for 2.1 (universal scaling proc) i386
82 cycles for sinsi's
79 cycles for sinsi's mod
86 cycles for sinsi's mod 2
75 cycles for sinsi's mod 3 (AMD)
76 cycles for sinsi's mod 4 (AMD)
106 cycles for 1
110 cycles for 1.2 (dec ecx)
72 cycles for 2 (universal scaling proc) CMOV
67 cycles for 2.1 (universal scaling proc) i386
82 cycles for sinsi's
79 cycles for sinsi's mod
86 cycles for sinsi's mod 2
75 cycles for sinsi's mod 3 (AMD)
77 cycles for sinsi's mod 4 (AMD)
106 cycles for 1
110 cycles for 1.2 (dec ecx)
71 cycles for 2 (universal scaling proc) CMOV
68 cycles for 2.1 (universal scaling proc) i386
83 cycles for sinsi's
79 cycles for sinsi's mod
86 cycles for sinsi's mod 2
76 cycles for sinsi's mod 3 (AMD)
76 cycles for sinsi's mod 4 (AMD)
 ok 

i had an idea, this morning :biggrin:
the results aren't as fast, which is to be expected
but the function is very flexible...
;Flexible Bit Stretcher  DednDave  1, 2013
;
; Creates variablemultiples of bits from an input word.
;The repeat count for each input bit is derived as follows:
; (1) highorder count bit from high word of wParam
; (2) loworder count bits from lParam
;These 3 bits allow for individual repeat counts of 0 to 7 for each input bit.
these are the results on my prescott if i set it up to double all the bits...
00001h: 496 498 498 497 497
000FFh: 593 594 593 590 613
0AAAAh: 596 598 599 596 599
0FF00h: 593 595 594 592 593
0FFFFh: 640 639 641 640 641

Thanks to all for your tests! :biggrin:
Hi Dave :t
Your test:
00001h: 538 540 523 524 530
000FFh: 617 618 618 623 624
0AAAAh: 633 636 647 630 631
0FF00h: 639 624 638 641 645
0FFFFh: 679 674 678 675 674
Press any key to continue ...

hi Alex
what do you think of the idea, though :P
i was thinking of ways it might be used
among other things, it could be used to create complex state machines, i guess
it could be used to create variable bitmasks

I have not read the code yet, Dave :biggrin:
From what you said it sounds good.