Chen-Ho compression

Gunther · April 10, 2014, 09:28:58 PM

Attached is the archive ChenHo.zip. The starting point was this Campus thread a few days ago.

The archive contains all the necessary files for building the running EXE. There's a subdirectory for jWasm and MASM and another for NASM and YASM with the appropriate sources. GCC is the C compiler, but the source should compile with VS or Pelle's C, too (not tested). Could anybody help testing that?

A more or less detailed description of the used algorithm can be found here. The algorithm needs no multiplication or division; only logical operations and shifts are necessary. All data can be hold inside the CPU. There are two procedures:

The compression from BCD to Chen-Ho.
The Expansion from Chen-Ho to BCD.

These are approximately 400 lines of code. But there are no loops or jumps, just simple and straight forward AND, OR + shifting inside the processor. The prefetch queue stays healthy over the entire procedure. Most of the instructions (not all) are pairable; so the compression and expansion are very fast.

The YASM and jWasm sources are similar. But there is one big difference by moving a general register into a XMM register and vice versa. Please check for details the interesting discussion here.

The including EXE brings the following output:

Code Select


BCD to Chen-Ho compression:
---------------------------

BCD value     = 923
Chen-Ho value = 253

Chen-Ho to BCD expansion:
-------------------------

Chen-Ho value = 253
BCD value     = 923

Please press a key to end the program ...

Help and suggestions for improvements are welcome.

Gunther

GoneFishing · April 11, 2014, 12:17:04 AM

Hi Gunther,
MS VC 2010 successfully built it after few modifications (inttypes.h was not found , so I replaced its functionality):

Quote
#include <stdio.h>
#include <stdlib.h>

extern unsigned long long int Bcd2Ch(unsigned long long int value);
extern unsigned long long int Ch2Bcd(unsigned long long int value);

int main(int argc, char *argv[])
{
unsigned long long int BCD1 = 0x0000000000000923;
unsigned long long int BCD2 = 0x0000000000000000;
unsigned long long int CH = 0x0000000000000000;
char c;

// Compress BCD to Chen-Ho and display the result.

CH = Bcd2Ch(BCD1);
printf("BCD to Chen-Ho compression:\n");
printf("---------------------------\n\n");
printf("BCD value = %x\n", BCD1);
printf("Chen-Ho value = %x\n\n", CH);

// Expand Chen-Ho to BCD and display the result.

BCD2 = Ch2Bcd(CH);
printf("Chen-Ho to BCD expansion:\n");
printf("-------------------------\n\n");
printf("Chen-Ho value = %x\n", CH);
printf("BCD value = %x\n\n", BCD2);
printf("Please press a key to end the program ...\n");
c = getchar();
return 0;
}

Gunther · April 11, 2014, 02:29:35 AM

Hi vertograd,

Quote from: vertograd on April 11, 2014, 12:17:04 AM
Hi Gunther,
MS VC 2010 successfully built it after few modifications (inttypes.h was not found , so I replaced its functionality):

thank you. :t inttypes.h is part of C99; an alternative could be that. If that works, we could include it via macro or so - just an idea.

Gunther

Gunther · April 11, 2014, 02:42:08 AM

I've found that: inttypes.h is now part of VS 2013. Details can be found here.

Gunther

Antariy · April 13, 2014, 05:55:55 AM

Hi Gunther,

Thank you for sharing your code - it is interesting task and it's solution :t I did not hear about Chen-Ho before looking at your threads and code

As about BTC - probably it may be replace with XOR with 1 at the bit position to complement, but probably with modern CPUs there are no speed up with that. Also it maybe XORed with partial reg - for example byte reg (BL for RBX SL for RSI etc), sometimes for some CPUs this is faster that XOR with full reg (8 bits vs 32 bits to process, probably there is the reason for that), but, again, probably BTC and XOR have no speed difference.

I have noticed a couple of places where the code maybe changed a bit. The code is already very highly packed and neat so there are very little place for optimization :t

The commented lines are not removed, added instructions are with no tabulation between instruction opcode and registers.

Here are A&I is precalculated and reused, also shift for 1 and 2 bits and OR is combined with LEA with 2 and 4 as scale + offset in register which was ORed. Overall economy 4 instructions. Also XOR RDX,1 is used instead BTC RDX,0.

Code Select


        ;btc        rdx, 0                ; rdx = \A

        ; register content:
        ; rax = function result
        ; rbx, rcx free for calculations
        ; rdx = \A; rdi = \E; rsi = \I; r15 = K; r14 = J; r13 = I;
        ; r12 = G; r11 = F; r10 = E; r9 = C; r8 = B; rbp = A
        ; operators are: '&'=AND; '|'=OR; '\'=NOT 

        ; Bit X = K|(C&I)|(G&A&I)

        and rdx,r13	; A&I

        mov        rbx, r9               ; rbx = C
        mov        rcx, r12              ; rcx = G
        and        rbx, r13              ; rbx = (C&I)
        ;and        rcx, rbp              ; rcx = G&A
        or         rbx, r15              ; rbx = K|(C&I)
        ;and        rcx, r13              ; rcx = (G&A&I)
        and rcx,edx
        or         rbx, rcx              ; rbx = K|(C&I)|(G&A&I)
        ;shl        rbx, 1                ; rbx = X
        ;or         rax, rbx              ; insert X
        lea eax,[eax+ebx*2]

        ; Bit W = J|(B&I)|(F&A&I)

        mov        rbx, r8               ; rbx = B
        mov        rcx, r11              ; rcx = F
        and        rbx, r13              ; rbx = (B&I)
        ;and        rcx, rbp              ; rcx = (F&A)
        or         rbx, r14              ; rbx = J|(B&I)
        ;and        rcx, r13              ; rcx = (F&A&I)
        and rcx,edx
        or         rbx, rcx              ; rbx = J|(B&I)|(F&A&I)
        ;shl        rbx, 2                ; rbx = W
        ;or         rax, rbx              ; insert W
        lea eax,[eax+ebx*4]

        ; At this point J and K ar no longer needed.
        ; We've now the following register content:
        ; rax = function result
        ; rbx, rcx, r15, r14 free for calculations
        ; rdx = \A; rdi = \E; rsi = \I; r13 = I;
        ; r12 = G; r11 = F; r10 = E; r9 = C; r8 = B; rbp = A

        ; Bit U = (A&I)|(C&E& \I)|(G& \E)

        ;mov        rbx, rbp              ; rbx = A
        mov        rcx, r9               ; rcx = C
        mov        r14, r12              ; r14 = G
        ;and        rbx, r13              ; rbx = (A&I)
        and        r14, rdi              ; r14 = (G& \E)
        and        rcx, r10              ; rcx = (C&E)
        ;or         rbx, r14              ; rbx = (A&I)|(G& \E)
        or rdx,r14
        and        rcx, rsi              ; rcx = (C&E& \I)
        ;or         rbx, rcx              ; rbx = (A&I)|(C&E& \I)|(G& \E)
        or rdx,rcx
        ;shl        rbx, 4                ; rbx = U
        shl rdx,4
        ;or         rax, rbx              ; insert U
        or rax,rdx
        
	mov rdx,rbp
	xor rdx,1

Also we can combine ORing and shifting where the shifting amount is smaller than or equal to 3, like code below:

Code Select


        and        rbx, IVMASK           ; isolate V
;        shl        rbx, 1                ; rbx = H
;        or         rax, rbx              ; insert H
        lea rax,[rax+rbx*2]

There are 5 such places in Ch2Bcd so it's 5 instructions economy.
Also maybe some other places, like A&I in the first example above, maybe reused to avoid re-calculation, but the code seems to be very neat (and very well commented!) and hard to be optimized further :t

Gunther · April 13, 2014, 08:33:57 PM

Alex,

thank you for checking the code and for your suggestions. I'll check the material and probably consider your remarks.

Gunther

Antariy · April 14, 2014, 01:23:38 AM

You are welcome, Gunther!

I did not test the example since my system is 32 bit.
I mistyped in the first code block - there is and rcx,Edx instead of Rdx.

Also maybe it more effective in therms of speed and code size to use 32 bit regs where it's possible instead of 64 bit regs which require extended regs prefix? Probably it may save some useful cycles and bytes over the code when ANDs and shifts are doing on 32 bit regs.

Gunther · April 14, 2014, 01:33:29 AM

Alex,

Quote from: Antariy on April 14, 2014, 01:23:38 AM
I did not test the example since my system is 32 bit.

time to change, my friend. 8)

Quote from: Antariy on April 14, 2014, 01:23:38 AM
Also maybe it more effective in therms of speed and code size to use 32 bit regs where it's possible instead of 64 bit regs which require extended regs prefix? Probably it may save some useful cycles and bytes over the code when ANDs and shifts are doing on 32 bit regs.

I'll think about that point. Thank you.

Gunther

The MASM Forum

News:

Chen-Ho compression

Gunther

GoneFishing

Gunther

Gunther

Antariy

Gunther

Antariy

Gunther