I'm fighting at the moment with the Chen-Ho encoding. It's not so hard and complicated as it sounds, but it's a lot of bit fumbling. My plan is to use the BT instruction family (BT, BTC, BTR, BTS). Is that a good idea? I've read that those instructions tend to be slow. Has anyone experiences?
Gunther
Gunther,
I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.
What about a LUT? 212*2Byte = 8KB is nothing much.
LUT sounds good. Why don' you go straight for DPD (http://speleotrove.com/decimal/DPDecimal.html)?
Steve,
Quote from: hutch-- on April 06, 2014, 02:38:53 AM
I usually try and use AND masking instead of bit instructions. TEST is also useful in some bit contexts.
the trick is, some bits must be complemented. Therefore I think that BTC is not so bad.
qWord,
Quote from: qWord on April 06, 2014, 02:54:23 AM
What about a LUT? 212*2Byte = 8KB is nothing much.
yes a LUT can speed up things, I know. On the other hand there are no multiplications or divisions necessary for encoding and decoding to or from BCD. Only logical instructions. Under 64 bit Windows and Linux we can hold all values inside the CPU. That's a definitive advantage.
Jochen,
Quote from: jj2007 on April 06, 2014, 03:12:30 AM
LUT sounds good. Why don' you go straight for DPD (http://speleotrove.com/decimal/DPDecimal.html)?
Densely Packed Decimal encoding was invented in 2002 and is a refinement to Chen-Ho. I need such an encoding scheme for a special purpose and I'm not sure at the moment which one to use.
Gunther
I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can't see any way to go faster.
Michael,
Quote from: MichaelW on April 06, 2014, 04:11:29 AM
I did a test of BT a while back and the timing results were something like 4-5 cycles on a P3 and ~10 cycles on a P4 Northwood. If you need the full functionality, including the modulo 16, 32, or 64 of the bit-offset operand, I can't see any way to go faster.
thank you for the information. I'll try the BTC instruction and I'll see if it fits my needs.
Gunther
I thought test was faster than bt, but on my CPU, they perform identically:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
11161 cycles for 100 * bt
11155 cycles for 100 * test
Jochen,
thank you for the test bed. Here are the results:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
7433 cycles for 100 * bt
7453 cycles for 100 * test
6849 cycles for 100 * bt
6830 cycles for 100 * test
6833 cycles for 100 * bt
6822 cycles for 100 * test
15 bytes for bt
16 bytes for test
--- ok ---
The same results here. On the other hand, the test bed isn't realistic enough:
btc eax, 7
selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.
Gunther
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
27368 cycles for 100 * bt
27322 cycles for 100 * test
27366 cycles for 100 * bt
27316 cycles for 100 * test
27329 cycles for 100 * bt
27284 cycles for 100 * test
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
7853 cycles for 100 * bt
7854 cycles for 100 * test
7853 cycles for 100 * bt
7855 cycles for 100 * test
7853 cycles for 100 * bt
7853 cycles for 100 * test
15 bytes for bt
16 bytes for test
Quote from: Gunther on April 06, 2014, 09:02:05 PMthe test bed isn't realistic enough:
btc eax, 7
selects bit 7 in EAX, stores the value of that bit in the CF flag, and complements bit 7 in EAX. While TEST computes the bit-wise logical AND. We have to add the time to complement the appropriate bit.
Just delete the c in btc and test again.
Jochen,
Quote from: jj2007 on April 07, 2014, 01:12:52 AM
Just delete the c in btc and test again.
that's not the point. I think that the time differences are not so dramatic. But BTC is exactly the instruction what I would need.
Gunther
Hi,
Ran the test on some older computers. On the oldest, the
bit test instruction was slower. Though not horribly so.
Regards,
Steve N.
pre-P4 (SSE1)
11835 cycles for 100 * bt
11823 cycles for 100 * test
11831 cycles for 100 * bt
11832 cycles for 100 * test
11832 cycles for 100 * bt
11823 cycles for 100 * test
15 bytes for bt
16 bytes for test
--- ok ---pre-P4
28376 cycles for 100 * bt
11188 cycles for 100 * test
28371 cycles for 100 * bt
11202 cycles for 100 * test
28371 cycles for 100 * bt
11195 cycles for 100 * test
15 bytes for bt
16 bytes for test
--- ok ---
pre-P4
11878 cycles for 100 * bt
11947 cycles for 100 * test
11936 cycles for 100 * bt
11929 cycles for 100 * test
11955 cycles for 100 * bt
11920 cycles for 100 * test
15 bytes for bt
16 bytes for test
--- ok ---
Hi Steve,
Quote from: FORTRANS on April 07, 2014, 06:30:13 AM
Ran the test on some older computers. On the oldest, the
bit test instruction was slower. Though not horribly so.
interesting results. The second pre-P4 looks very strange.
Gunther
Gunther,
If you need to use the BT family of instructions, if you carefully select the instructions around it you can probably get a few following instruction executions to fit into any stall hole it may leave.
Steve,
Quote from: hutch-- on April 07, 2014, 11:15:18 AM
If you need to use the BT family of instructions, if you carefully select the instructions around it you can probably get a few following instruction executions to fit into any stall hole it may leave.
that's my plan. I've finished the BCD to Chen-Ho compression. It's a bit lengthy, but fast. There are no data Cache or RAM accesses necessary. All values are inside of registers. The code contains no loops or jumps, only AND, OR, BTC and MOV on a register to register basis. So the prefetch queue stands healthy over the entire procedure. I think that's a good starting point.
I've now to write the Chen-Ho to BCD expansion. That should work in a similar way. If it works, I'll open a new thread with a test program. I need all that for a special encoding question in Geneva.
Gunther
Hi Gunther,
Quote from: Gunther on April 07, 2014, 08:20:58 AM
interesting results. The second pre-P4 looks very strange.
Oh? Well it is a Pentium MMX in a laptop. Looking at some old
opcode timings, the bit test instructions are a bit slow compared
to logical or arithmetic operations, but not too bad. The bit scan
instructions do look rather bad for 386/483/Pentium. Maybe that
is where the bit test operations got a reputation for being slow.
The other two processors were a P-III and a P-II.
By the way, using the Chen-Ho encoding (or DPD) seems
somewhat complex when compared to BCD or binary. (At first
glance to me anyway.) When you finish coding your project I
would be interested in a timing comparison of Chen-Ho, BCD,
and binary processing if that is possible. Idle curiosity, so don't
do it if it does not interest you.
Regards,
Steve N.
Hi Steve,
Quote from: FORTRANS on April 07, 2014, 11:56:52 PM
By the way, using the Chen-Ho encoding (or DPD) seems
somewhat complex when compared to BCD or binary. (At first
glance to me anyway.) When you finish coding your project I
would be interested in a timing comparison of Chen-Ho, BCD,
and binary processing if that is possible. Idle curiosity, so don't
do it if it does not interest you.
you're right. The operating times are: Chen-Ho > BCD > Binary. We can test it, but that'll be the result. Chen-Ho or DPD are only of interest to save bandwith for temporary storage.
Let me give you an example. For some image compression techniques, it's a good idea to save some transformation parameters, like scale and offset. Those parameters are stored in 12 bit each, makes 24 bit together. Chen-Ho or DPD can store 12 bit into 10 bit without information loss. That saves 400 bit for 100 transformations. Large images have often several thousands of such transformations.
Gunther
Hi Gunther,
Well Chen-Ho will save bytes over BCD, but not over binary. So
I guess my question would be, why do you need to use decimal digits
in your code. Or maybe I have missed something.
Regards,
Steve N.
Steve,
Quote from: FORTRANS on April 08, 2014, 08:07:47 AM
Well Chen-Ho will save bytes over BCD, but not over binary. So
I guess my question would be, why do you need to use decimal digits
in your code. Or maybe I have missed something.
take care. Chen-Ho and DPD have the same efficiency like the binary represantation. For example: The decimal value 999 has the binary representation 1111100111. The Chen-Ho representation is 111 111 1001 and as DPD value it would be 001 111 1111.
I wrote in reply #18 about scale and offset. The offset isn't the point, because these are integer values. But the scale is often a fractional decimal value and that's a real problem because of rounding errors. With BCD it can be avoided, but that's not so efficient. That's the reason for DPD or Chen-Ho.
For a more technical interpretation: The scale is the contrast and the offset is the brightness of an image region.
Gunther
Hi Gunther,
Okay. Thank you for the explanation. I will have to consider
it to see if I see the practicality of it.
QuoteChen-Ho and DPD have the same efficiency like the binary represantation.
An interesting observation on its own. Though 999 versus
1024 if decimal digits are not required. I suppose I should
investigate the subject further.
Regards,
Steve N.
let's say you have 5 values
10000 2710h, 14 bits
10001 2711h, 14 bits
10003 2713h, 14 bits
10005 2715h, 14 bits
10007 2717h, 14 bits
can be stored as
10000 2710h, 14 bits
1 1, 3 bits
3 3, 3 bits
5 5, 3 bits
7 7, 3 bits
AMD-K5(tm) Processor
9882 cycles for 100 * bt
6703 cycles for 100 * test
9883 cycles for 100 * bt
6687 cycles for 100 * test
9886 cycles for 100 * bt
6685 cycles for 100 * test
15 bytes for bt
16 bytes for test