The MASM Forum

General => The Campus => Topic started by: Gunther on April 06, 2014, 02:54:40 AM

Title: BT instruction family
Post by: Gunther on April 06, 2014, 02:54:40 AM
I'm fighting at the moment with the Chen-Ho encoding. It's not so hard and complicated as it sounds, but it's a lot of bit fumbling. My plan is to use the BT instruction family (BT, BTC, BTR, BTS). Is that a good idea? I've read that those instructions tend to be slow. Has anyone experiences?

Gunther 
Title: Re: BT instruction family
Post by: hutch-- on April 06, 2014, 02:38:53 AM
Gunther,

I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.
Title: Re: BT instruction family
Post by: qWord on April 06, 2014, 02:54:23 AM
What about a LUT? 212*2Byte = 8KB is nothing much.
Title: Re: BT instruction family
Post by: jj2007 on April 06, 2014, 03:12:30 AM
LUT sounds good. Why don' you go straight for DPD (http://speleotrove.com/decimal/DPDecimal.html)?
Title: Re: BT instruction family
Post by: Gunther on April 06, 2014, 04:05:04 AM
Steve,

Quote from: hutch-- on April 06, 2014, 02:38:53 AM
I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.

the trick is, some bits must be complemented. Therefore I think that BTC is not so bad.

qWord,

Quote from: qWord on April 06, 2014, 02:54:23 AM
What about a LUT? 212*2Byte = 8KB is nothing much.

yes a LUT can speed up things, I know. On the other hand there are no multiplications or divisions necessary for encoding and decoding to or from BCD. Only logical instructions. Under 64 bit Windows and Linux we can hold all values inside the CPU. That's a definitive advantage.

Jochen,

Quote from: jj2007 on April 06, 2014, 03:12:30 AM
LUT sounds good. Why don' you go straight for DPD (http://speleotrove.com/decimal/DPDecimal.html)?

Densely Packed Decimal encoding was invented in 2002 and is a refinement to Chen-Ho. I need such an encoding scheme for a special purpose and I'm not sure at the moment which one to use.

Gunther   
Title: Re: BT instruction family
Post by: MichaelW on April 06, 2014, 04:11:29 AM
I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can't see any way to go faster.
Title: Re: BT instruction family
Post by: Gunther on April 06, 2014, 07:07:50 PM
Michael,

Quote from: MichaelW on April 06, 2014, 04:11:29 AM
I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can't see any way to go faster.

thank you for the information. I'll try the BTC instruction and I'll see if it fits my needs.

Gunther
Title: Re: BT instruction family
Post by: jj2007 on April 06, 2014, 08:17:39 PM
I thought test was faster than bt, but on my CPU, they perform identically:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11161   cycles for 100 * bt
11155   cycles for 100 * test
Title: Re: BT instruction family
Post by: Gunther on April 06, 2014, 09:02:05 PM
Jochen,

thank you for the test bed. Here are the results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

7433    cycles for 100 * bt
7453    cycles for 100 * test

6849    cycles for 100 * bt
6830    cycles for 100 * test

6833    cycles for 100 * bt
6822    cycles for 100 * test

15      bytes for bt
16      bytes for test

--- ok ---


The same results here. On the other hand, the test bed isn't realistic enough:

        btc        eax, 7

selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.

Gunther
Title: Re: BT instruction family
Post by: dedndave on April 06, 2014, 09:32:32 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

27368   cycles for 100 * bt
27322   cycles for 100 * test

27366   cycles for 100 * bt
27316   cycles for 100 * test

27329   cycles for 100 * bt
27284   cycles for 100 * test
Title: Re: BT instruction family
Post by: Siekmanski on April 07, 2014, 12:54:14 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

7853    cycles for 100 * bt
7854    cycles for 100 * test

7853    cycles for 100 * bt
7855    cycles for 100 * test

7853    cycles for 100 * bt
7853    cycles for 100 * test

15      bytes for bt
16      bytes for test
Title: Re: BT instruction family
Post by: jj2007 on April 07, 2014, 01:12:52 AM
Quote from: Gunther on April 06, 2014, 09:02:05 PMthe test bed isn't realistic enough:

        btc        eax, 7

selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.

Just delete the c in btc and test again.
Title: Re: BT instruction family
Post by: Gunther on April 07, 2014, 04:58:22 AM
Jochen,

Quote from: jj2007 on April 07, 2014, 01:12:52 AM
Just delete the c in btc and test again.

that's not the point. I think that the time differences are not so dramatic. But BTC is exactly the instruction what I would need.

Gunther
Title: Re: BT instruction family
Post by: FORTRANS on April 07, 2014, 06:30:13 AM
Hi,

   Ran the test on some older computers.  On the oldest, the
bit test instruction was slower.  Though not horribly so.

Regards,

Steve N.

pre-P4 (SSE1)

11835   cycles for 100 * bt
11823   cycles for 100 * test

11831   cycles for 100 * bt
11832   cycles for 100 * test

11832   cycles for 100 * bt
11823   cycles for 100 * test

15   bytes for bt
16   bytes for test


--- ok ---pre-P4
28376   cycles for 100 * bt
11188   cycles for 100 * test

28371   cycles for 100 * bt
11202   cycles for 100 * test

28371   cycles for 100 * bt
11195   cycles for 100 * test

15   bytes for bt
16   bytes for test


--- ok ---
pre-P4
11878   cycles for 100 * bt
11947   cycles for 100 * test

11936   cycles for 100 * bt
11929   cycles for 100 * test

11955   cycles for 100 * bt
11920   cycles for 100 * test

15      bytes for bt
16      bytes for test


--- ok ---
Title: Re: BT instruction family
Post by: Gunther on April 07, 2014, 08:20:58 AM
Hi Steve,

Quote from: FORTRANS on April 07, 2014, 06:30:13 AM
   Ran the test on some older computers.  On the oldest, the
bit test instruction was slower.  Though not horribly so.

interesting results. The second pre-P4 looks very strange.

Gunther
Title: Re: BT instruction family
Post by: hutch-- on April 07, 2014, 11:15:18 AM
Gunther,

If you need to use the BT family of instructions, if you carefully select the instructions around it you can probably get a few following instruction executions to fit into any stall hole it may leave.
Title: Re: BT instruction family
Post by: Gunther on April 07, 2014, 06:41:12 PM
Steve,

Quote from: hutch-- on April 07, 2014, 11:15:18 AM
If you need to use the BT family of instructions, if you carefully select the instructions around it you can probably get a few following instruction executions to fit into any stall hole it may leave.

that's my plan. I've finished the BCD to Chen-Ho compression. It's a bit lengthy, but fast. There are no data Cache or RAM accesses necessary. All values are inside of registers. The code contains no loops or jumps, only AND, OR, BTC and MOV on a register to register basis. So the prefetch queue stands healthy over the entire procedure. I think that's a good starting point.

I've now to write the Chen-Ho to BCD expansion. That should work in a similar way. If it works, I'll open a new thread with a test program. I need all that for a special encoding question in Geneva.

Gunther
Title: Re: BT instruction family
Post by: FORTRANS on April 07, 2014, 11:56:52 PM
Hi Gunther,

Quote from: Gunther on April 07, 2014, 08:20:58 AM
interesting results. The second pre-P4 looks very strange.

   Oh?  Well it is a Pentium MMX in a laptop.  Looking at some old
opcode timings, the bit test instructions are a bit slow compared
to logical or arithmetic operations, but not too bad.  The bit scan
instructions do look rather bad for 386/483/Pentium.  Maybe that
is where the bit test operations got a reputation for being slow.

   The other two processors were a P-III and a P-II.

   By the way, using the Chen-Ho encoding (or DPD) seems
somewhat complex when compared to BCD or binary.  (At first
glance to me anyway.)  When you finish coding your project I
would be interested in a timing comparison of Chen-Ho, BCD,
and binary processing if that is possible.  Idle curiosity, so don't
do it if it does not interest you.

Regards,

Steve N.
Title: Re: BT instruction family
Post by: Gunther on April 08, 2014, 01:10:13 AM
Hi Steve,

Quote from: FORTRANS on April 07, 2014, 11:56:52 PM
   By the way, using the Chen-Ho encoding (or DPD) seems
somewhat complex when compared to BCD or binary.  (At first
glance to me anyway.)  When you finish coding your project I
would be interested in a timing comparison of Chen-Ho, BCD,
and binary processing if that is possible.  Idle curiosity, so don't
do it if it does not interest you.

you're right. The operating times are: Chen-Ho > BCD > Binary. We can test it, but that'll be the result. Chen-Ho or DPD are only of interest to save bandwith for temporary storage.

Let me give you an example. For some image compression techniques, it's a good idea to save some transformation parameters, like scale and offset. Those parameters are stored in 12 bit each, makes 24 bit together. Chen-Ho or DPD can store 12 bit into 10 bit without information loss. That saves 400 bit for 100 transformations. Large images have often several thousands of such transformations.

Gunther 
Title: Re: BT instruction family
Post by: FORTRANS on April 08, 2014, 08:07:47 AM
Hi Gunther,

   Well Chen-Ho will save bytes over BCD, but not over binary.  So
I guess my question would be, why do you need to use decimal digits
in your code.  Or maybe I have missed something.

Regards,

Steve N.
Title: Re: BT instruction family
Post by: Gunther on April 08, 2014, 08:43:51 AM
Steve,

Quote from: FORTRANS on April 08, 2014, 08:07:47 AM
   Well Chen-Ho will save bytes over BCD, but not over binary.  So
I guess my question would be, why do you need to use decimal digits
in your code.  Or maybe I have missed something.

take care. Chen-Ho and DPD have the same efficiency like the binary represantation. For example: The decimal value 999 has the binary representation 1111100111. The Chen-Ho representation is 111 111 1001 and as DPD value it would be 001 111 1111.

I wrote in reply #18 about scale and offset. The offset isn't the point, because these are integer values. But the scale is often a fractional decimal value and that's a real problem because of rounding errors. With BCD it can be avoided, but that's not so efficient. That's the reason for DPD or Chen-Ho.

For a more technical interpretation: The scale is the contrast and the offset is the brightness of an image region.

Gunther
Title: Re: BT instruction family
Post by: FORTRANS on April 08, 2014, 11:27:05 PM
Hi Gunther,

   Okay.  Thank you for the explanation.  I will have to consider
it to see if I see the practicality of it.

QuoteChen-Ho and DPD have the same efficiency like the binary represantation.

   An interesting observation on its own.  Though 999 versus
1024 if decimal digits are not required.  I suppose I should
investigate the subject further.

Regards,

Steve N.
Title: Re: BT instruction family
Post by: dedndave on April 09, 2014, 01:18:31 AM
let's say you have 5 values

10000 2710h, 14 bits
10001 2711h, 14 bits
10003 2713h, 14 bits
10005 2715h, 14 bits
10007 2717h, 14 bits


can be stored as

10000 2710h, 14 bits
1     1, 3 bits
3     3, 3 bits
5     5, 3 bits
7     7, 3 bits

Title: Re: BT instruction family
Post by: MichaelW on April 12, 2014, 12:30:50 PM

AMD-K5(tm) Processor
9882    cycles for 100 * bt
6703    cycles for 100 * test

9883    cycles for 100 * bt
6687    cycles for 100 * test

9886    cycles for 100 * bt
6685    cycles for 100 * test

15      bytes for bt
16      bytes for test