News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

BT instruction family

Started by Gunther, April 06, 2014, 02:54:40 AM

Previous topic - Next topic

Gunther

I'm fighting at the moment with the Chen-Ho encoding. It's not so hard and complicated as it sounds, but it's a lot of bit fumbling. My plan is to use the BT instruction family (BT, BTC, BTR, BTS). Is that a good idea? I've read that those instructions tend to be slow. Has anyone experiences?

Gunther 
You have to know the facts before you can distort them.

hutch--

Gunther,

I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.

qWord

What about a LUT? 212*2Byte = 8KB is nothing much.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

LUT sounds good. Why don' you go straight for DPD?

Gunther

Steve,

Quote from: hutch-- on April 06, 2014, 02:38:53 AM
I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.

the trick is, some bits must be complemented. Therefore I think that BTC is not so bad.

qWord,

Quote from: qWord on April 06, 2014, 02:54:23 AM
What about a LUT? 212*2Byte = 8KB is nothing much.

yes a LUT can speed up things, I know. On the other hand there are no multiplications or divisions necessary for encoding and decoding to or from BCD. Only logical instructions. Under 64 bit Windows and Linux we can hold all values inside the CPU. That's a definitive advantage.

Jochen,

Quote from: jj2007 on April 06, 2014, 03:12:30 AM
LUT sounds good. Why don' you go straight for DPD?

Densely Packed Decimal encoding was invented in 2002 and is a refinement to Chen-Ho. I need such an encoding scheme for a special purpose and I'm not sure at the moment which one to use.

Gunther   
You have to know the facts before you can distort them.

MichaelW

I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can't see any way to go faster.
Well Microsoft, here's another nice mess you've gotten us into.

Gunther

Michael,

Quote from: MichaelW on April 06, 2014, 04:11:29 AM
I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can't see any way to go faster.

thank you for the information. I'll try the BTC instruction and I'll see if it fits my needs.

Gunther
You have to know the facts before you can distort them.

jj2007

I thought test was faster than bt, but on my CPU, they perform identically:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11161   cycles for 100 * bt
11155   cycles for 100 * test

Gunther

Jochen,

thank you for the test bed. Here are the results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

7433    cycles for 100 * bt
7453    cycles for 100 * test

6849    cycles for 100 * bt
6830    cycles for 100 * test

6833    cycles for 100 * bt
6822    cycles for 100 * test

15      bytes for bt
16      bytes for test

--- ok ---


The same results here. On the other hand, the test bed isn't realistic enough:

        btc        eax, 7

selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.

Gunther
You have to know the facts before you can distort them.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

27368   cycles for 100 * bt
27322   cycles for 100 * test

27366   cycles for 100 * bt
27316   cycles for 100 * test

27329   cycles for 100 * bt
27284   cycles for 100 * test

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

7853    cycles for 100 * bt
7854    cycles for 100 * test

7853    cycles for 100 * bt
7855    cycles for 100 * test

7853    cycles for 100 * bt
7853    cycles for 100 * test

15      bytes for bt
16      bytes for test
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Gunther on April 06, 2014, 09:02:05 PMthe test bed isn't realistic enough:

        btc        eax, 7

selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.

Just delete the c in btc and test again.

Gunther

Jochen,

Quote from: jj2007 on April 07, 2014, 01:12:52 AM
Just delete the c in btc and test again.

that's not the point. I think that the time differences are not so dramatic. But BTC is exactly the instruction what I would need.

Gunther
You have to know the facts before you can distort them.

FORTRANS

Hi,

   Ran the test on some older computers.  On the oldest, the
bit test instruction was slower.  Though not horribly so.

Regards,

Steve N.

pre-P4 (SSE1)

11835   cycles for 100 * bt
11823   cycles for 100 * test

11831   cycles for 100 * bt
11832   cycles for 100 * test

11832   cycles for 100 * bt
11823   cycles for 100 * test

15   bytes for bt
16   bytes for test


--- ok ---pre-P4
28376   cycles for 100 * bt
11188   cycles for 100 * test

28371   cycles for 100 * bt
11202   cycles for 100 * test

28371   cycles for 100 * bt
11195   cycles for 100 * test

15   bytes for bt
16   bytes for test


--- ok ---
pre-P4
11878   cycles for 100 * bt
11947   cycles for 100 * test

11936   cycles for 100 * bt
11929   cycles for 100 * test

11955   cycles for 100 * bt
11920   cycles for 100 * test

15      bytes for bt
16      bytes for test


--- ok ---

Gunther

Hi Steve,

Quote from: FORTRANS on April 07, 2014, 06:30:13 AM
   Ran the test on some older computers.  On the oldest, the
bit test instruction was slower.  Though not horribly so.

interesting results. The second pre-P4 looks very strange.

Gunther
You have to know the facts before you can distort them.