News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Bits Scaling to 2

Started by Antariy, January 06, 2013, 12:25:19 PM

Previous topic - Next topic

dedndave

wow - didn't think it was that difficult
if that's too much for you, i suggest you stay with console apps   :lol:

but - you have options
for example, you could just make the table in the .DATA section - it's ony 512 bytes in size

another option is to compare the value of the last byte when you call the proc
if it isn't FFh, you call the routine to build the table
that way, it gets built automatically the first time you call it

now - as for 3 or 4 or more bits....
you never said anything about that   ;)
we can't read your mind - lol

Antariy

Quote from: dedndave on January 07, 2013, 11:50:36 AM
another option is to compare the value of the last byte when you call the proc
if it isn't FFh, you call the routine to build the table
that way, it gets built automatically the first time you call it

now - as for 3 or 4 or more bits....
you never said anything about that   ;)
we can't read your mind - lol

Why to check lower byte? That's not the question of how big is table or index, the table just should be a "resized copy" of a smaller 256 byte time. I.e., most space in it will be just a dummy bytes. Example:

original table:

AA BB

CC DD

the processed table

AA 00 BB 00

CC 00 DD 00



But this is example, real table (not the topic algo) will contain over 65000 dummy bytes and only 256 bytes of the original, but "bit map resized" data.
It is required to avoid any index-fixing computations at runtime.

As for number of bits, yes, you are right, that was not said, but the algo contains a bit count parameter (not scalable algo yet, though), it is not unrolled at all etc what means the speed is not main concern, and the test was merely for seeing how such type of the code runs on dfferent CPUs, for asking a new mathematical ways to do this thing, and to conversate with you all (that's not joke) :P

dedndave

what i meant was something like this...

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

BitDbl  PROC    dwVal:DWORD

        cmp byte ptr waBdLut+511,0FFh
        jz      BitDb0

        call    InitBdLut

BitDb0: movzx   eax,byte ptr [esp+5]
        mov     ecx,offset waBdLut
        dec     eax
        movzx   edx,byte ptr [esp+4]
        mov     eax,[ecx+2*eax]
        mov     ax,[ecx+2*edx]
        ret     4

BitDbl  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef


that will add a few clock cycles
better to just define the table in .DATA   :P

but, if you want to handle a variety of bit-widths, you need some other way

sinsi

This gives me 57 cycles for 16 bits

    push ebx
    push esi
    sub eax,eax
    mov edx,[esp+12]
    mov ecx,[esp+16]
    mov ebx,11b
@@: sub esi,esi
    bt edx,0
    cmovc esi,ebx
    shr edx,1
    or eax,esi
    shl ebx,2
    sub ecx,1
    jnz @b
    pop esi
    pop ebx
    ret 8


Antariy

Hi sinsi :t

I have incorporated your code as well as added (probably crude) "universal" version to scale number of 2...16 bits by scale factor 16...2 (respectively).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
165     cycles for 1
198     cycles for 1.2 (dec ecx)
74      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
84      cycles for sinsi's mod 3 (AMD)
98      cycles for sinsi's mod 4 (AMD)
165     cycles for 1
200     cycles for 1.2 (dec ecx)
73      cycles for 2 (universal scaling proc) CMOV
109     cycles for 2.1 (universal scaling proc) i386
99      cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)
164     cycles for 1
197     cycles for 1.2 (dec ecx)
74      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)

--- ok ---

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
167     cycles for 1
197     cycles for 1.2 (dec ecx)
79      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)
167     cycles for 1
197     cycles for 1.2 (dec ecx)
76      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
99      cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
85      cycles for sinsi's mod 4 (AMD)
167     cycles for 1
197     cycles for 1.2 (dec ecx)
75      cycles for 2 (universal scaling proc) CMOV
108     cycles for 2.1 (universal scaling proc) i386
100     cycles for sinsi's
76      cycles for sinsi's mod
76      cycles for sinsi's mod 2
82      cycles for sinsi's mod 3 (AMD)
81      cycles for sinsi's mod 4 (AMD)

Gunther

Hi Alex,

here the results:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
129 cycles for 1
100 cycles for 1.2 (dec ecx)
86 cycles for 2 (universal scaling proc) CMOV
102 cycles for 2.1 (universal scaling proc) i386
90 cycles for sinsi's
88 cycles for sinsi's mod
98 cycles for sinsi's mod 2
79 cycles for sinsi's mod 3 (AMD)
78 cycles for sinsi's mod 4 (AMD)
100 cycles for 1
129 cycles for 1.2 (dec ecx)
93 cycles for 2 (universal scaling proc) CMOV
84 cycles for 2.1 (universal scaling proc) i386
101 cycles for sinsi's
98 cycles for sinsi's mod
84 cycles for sinsi's mod 2
77 cycles for sinsi's mod 3 (AMD)
80 cycles for sinsi's mod 4 (AMD)
126 cycles for 1
95 cycles for 1.2 (dec ecx)
87 cycles for 2 (universal scaling proc) CMOV
102 cycles for 2.1 (universal scaling proc) i386
83 cycles for sinsi's
83 cycles for sinsi's mod
98 cycles for sinsi's mod 2
49 cycles for sinsi's mod 3 (AMD)
82 cycles for sinsi's mod 4 (AMD)

--- ok ---


Gunther
You have to know the facts before you can distort them.

Antariy

Hi Dave and Gunther :biggrin:
Thank you for the tests, I was posting in Gunther's thread and at that moment not noticed this topic updated.

jj2007

One more :biggrin:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
104     cycles for 1
122     cycles for 1.2 (dec ecx)
63      cycles for 2 (universal scaling proc) CMOV
80      cycles for 2.1 (universal scaling proc) i386
87      cycles for sinsi's
89      cycles for sinsi's mod
74      cycles for sinsi's mod 2
100     cycles for sinsi's mod 3 (AMD)
99      cycles for sinsi's mod 4 (AMD)

sinsi

Funny how the (AMD) ones are slowest on the AMD  :lol:

AMD Phenom(tm) II X6 1100T Processor (SSE3)
87      cycles for 1
78      cycles for 1.2 (dec ecx)
43      cycles for 2 (universal scaling proc) CMOV
46      cycles for 2.1 (universal scaling proc) i386
57      cycles for sinsi's
57      cycles for sinsi's mod
56      cycles for sinsi's mod 2
57      cycles for sinsi's mod 3 (AMD)
56      cycles for sinsi's mod 4 (AMD)

Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz (SSE4)
75      cycles for 1
76      cycles for 1.2 (dec ecx)
43      cycles for 2 (universal scaling proc) CMOV
48      cycles for 2.1 (universal scaling proc) i386
65      cycles for sinsi's
55      cycles for sinsi's mod
54      cycles for sinsi's mod 2
48      cycles for sinsi's mod 3 (AMD)
48      cycles for sinsi's mod 4 (AMD)

Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz (SSE4)
63      cycles for 1
59      cycles for 1.2 (dec ecx)
47      cycles for 2 (universal scaling proc) CMOV
51      cycles for 2.1 (universal scaling proc) i386
45      cycles for sinsi's
43      cycles for sinsi's mod
42      cycles for sinsi's mod 2
36      cycles for sinsi's mod 3 (AMD)
33      cycles for sinsi's mod 4 (AMD)


FORTRANS


pre-P4 (SSE1)
126     cycles for 1
126     cycles for 1.2 (dec ecx)
87      cycles for 2 (universal scaling proc) CMOV
88      cycles for 2.1 (universal scaling proc) i386
102     cycles for sinsi's
97      cycles for sinsi's mod
102     cycles for sinsi's mod 2
97      cycles for sinsi's mod 3 (AMD)
91      cycles for sinsi's mod 4 (AMD)
122     cycles for 1
124     cycles for 1.2 (dec ecx)
84      cycles for 2 (universal scaling proc) CMOV
85      cycles for 2.1 (universal scaling proc) i386
96      cycles for sinsi's
94      cycles for sinsi's mod
99      cycles for sinsi's mod 2
94      cycles for sinsi's mod 3 (AMD)
88      cycles for sinsi's mod 4 (AMD)
122     cycles for 1
124     cycles for 1.2 (dec ecx)
84      cycles for 2 (universal scaling proc) CMOV
85      cycles for 2.1 (universal scaling proc) i386
96      cycles for sinsi's
94      cycles for sinsi's mod
99      cycles for sinsi's mod 2
95      cycles for sinsi's mod 3 (AMD)
88      cycles for sinsi's mod 4 (AMD)

--- ok ---

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
106   cycles for 1
110   cycles for 1.2 (dec ecx)
71   cycles for 2 (universal scaling proc) CMOV
68   cycles for 2.1 (universal scaling proc) i386
82   cycles for sinsi's
79   cycles for sinsi's mod
86   cycles for sinsi's mod 2
75   cycles for sinsi's mod 3 (AMD)
76   cycles for sinsi's mod 4 (AMD)
106   cycles for 1
110   cycles for 1.2 (dec ecx)
72   cycles for 2 (universal scaling proc) CMOV
67   cycles for 2.1 (universal scaling proc) i386
82   cycles for sinsi's
79   cycles for sinsi's mod
86   cycles for sinsi's mod 2
75   cycles for sinsi's mod 3 (AMD)
77   cycles for sinsi's mod 4 (AMD)
106   cycles for 1
110   cycles for 1.2 (dec ecx)
71   cycles for 2 (universal scaling proc) CMOV
68   cycles for 2.1 (universal scaling proc) i386
83   cycles for sinsi's
79   cycles for sinsi's mod
86   cycles for sinsi's mod 2
76   cycles for sinsi's mod 3 (AMD)
76   cycles for sinsi's mod 4 (AMD)

--- ok ---

dedndave

i had an idea, this morning   :biggrin:
the results aren't as fast, which is to be expected
but the function is very flexible...
Quote;Flexible Bit Stretcher - DednDave - 1, 2013
;
;  Creates variable-multiples of bits from an input word.
;The repeat count for each input bit is derived as follows:
; (1) high-order count bit from high word of wParam
; (2) low-order count bits from lParam
;These 3 bits allow for individual repeat counts of 0 to 7 for each input bit.

these are the results on my prescott if i set it up to double all the bits...
00001h: 496 498 498 497 497
000FFh: 593 594 593 590 613
0AAAAh: 596 598 599 596 599
0FF00h: 593 595 594 592 593
0FFFFh: 640 639 641 640 641

Antariy

Thanks to all for your tests! :biggrin:

Hi Dave :t
Your test:

00001h: 538 540 523 524 530
000FFh: 617 618 618 623 624
0AAAAh: 633 636 647 630 631
0FF00h: 639 624 638 641 645
0FFFFh: 679 674 678 675 674
Press any key to continue ...

dedndave

hi Alex
what do you think of the idea, though   :P

i was thinking of ways it might be used
among other things, it could be used to create complex state machines, i guess
it could be used to create variable bit-masks

Antariy

I have not read the code yet, Dave :biggrin:
From what you said it sounds good.