Author Topic: BT instruction family  (Read 13911 times)

Gunther

  • Member
  • *****
  • Posts: 4198
  • Forgive your enemies, but never forget their names
BT instruction family
« on: April 06, 2014, 02:54:40 AM »
I'm fighting at the moment with the Chen-Ho encoding. It's not so hard and complicated as it sounds, but it's a lot of bit fumbling. My plan is to use the BT instruction family (BT, BTC, BTR, BTS). Is that a good idea? I've read that those instructions tend to be slow. Has anyone experiences?

Gunther 
You have to know the facts before you can distort them.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 10583
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: BT instruction family
« Reply #1 on: April 06, 2014, 02:38:53 AM »
Gunther,

I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

qWord

  • Member
  • *****
  • Posts: 1475
  • The base type of a type is the type itself
    • SmplMath macros
Re: BT instruction family
« Reply #2 on: April 06, 2014, 02:54:23 AM »
What about a LUT? 212*2Byte = 8KB is nothing much.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: BT instruction family
« Reply #3 on: April 06, 2014, 03:12:30 AM »
LUT sounds good. Why don' you go straight for DPD?

Gunther

  • Member
  • *****
  • Posts: 4198
  • Forgive your enemies, but never forget their names
Re: BT instruction family
« Reply #4 on: April 06, 2014, 04:05:04 AM »
Steve,

I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.

the trick is, some bits must be complemented. Therefore I think that BTC is not so bad.

qWord,

What about a LUT? 212*2Byte = 8KB is nothing much.

yes a LUT can speed up things, I know. On the other hand there are no multiplications or divisions necessary for encoding and decoding to or from BCD. Only logical instructions. Under 64 bit Windows and Linux we can hold all values inside the CPU. That's a definitive advantage.

Jochen,

LUT sounds good. Why don' you go straight for DPD?

Densely Packed Decimal encoding was invented in 2002 and is a refinement to Chen-Ho. I need such an encoding scheme for a special purpose and I'm not sure at the moment which one to use.

Gunther   
You have to know the facts before you can distort them.

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1196
Re: BT instruction family
« Reply #5 on: April 06, 2014, 04:11:29 AM »
I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can’t see any way to go faster.
Well Microsoft, here’s another nice mess you’ve gotten us into.

Gunther

  • Member
  • *****
  • Posts: 4198
  • Forgive your enemies, but never forget their names
Re: BT instruction family
« Reply #6 on: April 06, 2014, 07:07:50 PM »
Michael,

I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can’t see any way to go faster.

thank you for the information. I'll try the BTC instruction and I'll see if it fits my needs.

Gunther
You have to know the facts before you can distort them.

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: BT instruction family
« Reply #7 on: April 06, 2014, 08:17:39 PM »
I thought test was faster than bt, but on my CPU, they perform identically:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11161   cycles for 100 * bt
11155   cycles for 100 * test

Gunther

  • Member
  • *****
  • Posts: 4198
  • Forgive your enemies, but never forget their names
Re: BT instruction family
« Reply #8 on: April 06, 2014, 09:02:05 PM »
Jochen,

thank you for the test bed. Here are the results:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

7433    cycles for 100 * bt
7453    cycles for 100 * test

6849    cycles for 100 * bt
6830    cycles for 100 * test

6833    cycles for 100 * bt
6822    cycles for 100 * test

15      bytes for bt
16      bytes for test

--- ok ---

The same results here. On the other hand, the test bed isn't realistic enough:
Code: [Select]
        btc        eax, 7
selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.

Gunther
You have to know the facts before you can distort them.

dedndave

  • Member
  • *****
  • Posts: 8828
  • Still using Abacus 2.0
    • DednDave
Re: BT instruction family
« Reply #9 on: April 06, 2014, 09:32:32 PM »
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

27368   cycles for 100 * bt
27322   cycles for 100 * test

27366   cycles for 100 * bt
27316   cycles for 100 * test

27329   cycles for 100 * bt
27284   cycles for 100 * test

Siekmanski

  • Member
  • *****
  • Posts: 2725
Re: BT instruction family
« Reply #10 on: April 07, 2014, 12:54:14 AM »
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

7853    cycles for 100 * bt
7854    cycles for 100 * test

7853    cycles for 100 * bt
7855    cycles for 100 * test

7853    cycles for 100 * bt
7853    cycles for 100 * test

15      bytes for bt
16      bytes for test
Creative coders use backward thinking techniques as a strategy.

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: BT instruction family
« Reply #11 on: April 07, 2014, 01:12:52 AM »
the test bed isn't realistic enough:
Code: [Select]
        btc        eax, 7
selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.

Just delete the c in btc and test again.

Gunther

  • Member
  • *****
  • Posts: 4198
  • Forgive your enemies, but never forget their names
Re: BT instruction family
« Reply #12 on: April 07, 2014, 04:58:22 AM »
Jochen,

Just delete the c in btc and test again.

that's not the point. I think that the time differences are not so dramatic. But BTC is exactly the instruction what I would need.

Gunther
You have to know the facts before you can distort them.

FORTRANS

  • Member
  • *****
  • Posts: 1238
Re: BT instruction family
« Reply #13 on: April 07, 2014, 06:30:13 AM »
Hi,

   Ran the test on some older computers.  On the oldest, the
bit test instruction was slower.  Though not horribly so.

Regards,

Steve N.

pre-P4 (SSE1)

11835   cycles for 100 * bt
11823   cycles for 100 * test

11831   cycles for 100 * bt
11832   cycles for 100 * test

11832   cycles for 100 * bt
11823   cycles for 100 * test

15   bytes for bt
16   bytes for test


--- ok ---pre-P4
28376   cycles for 100 * bt
11188   cycles for 100 * test

28371   cycles for 100 * bt
11202   cycles for 100 * test

28371   cycles for 100 * bt
11195   cycles for 100 * test

15   bytes for bt
16   bytes for test


--- ok ---
pre-P4
11878   cycles for 100 * bt
11947   cycles for 100 * test

11936   cycles for 100 * bt
11929   cycles for 100 * test

11955   cycles for 100 * bt
11920   cycles for 100 * test

15      bytes for bt
16      bytes for test


--- ok ---

Gunther

  • Member
  • *****
  • Posts: 4198
  • Forgive your enemies, but never forget their names
Re: BT instruction family
« Reply #14 on: April 07, 2014, 08:20:58 AM »
Hi Steve,

   Ran the test on some older computers.  On the oldest, the
bit test instruction was slower.  Though not horribly so.

interesting results. The second pre-P4 looks very strange.

Gunther
You have to know the facts before you can distort them.