Hi
Gunther,
Thank you for sharing your code - it is interesting task and it's solution :t I did not hear about Chen-Ho before looking at your threads and code

As about BTC - probably it may be replace with XOR with 1 at the bit position to complement, but probably with modern CPUs there are no speed up with that. Also it maybe XORed with partial reg - for example byte reg (BL for RBX SL for RSI etc), sometimes for some CPUs this is faster that XOR with full reg (8 bits vs 32 bits to process, probably there is the reason for that), but, again, probably BTC and XOR have no speed difference.
I have noticed a couple of places where the code maybe changed a bit. The code is already very highly packed and neat so there are very little place for optimization :t
The commented lines are not removed, added instructions are with no tabulation between instruction opcode and registers.
Here are A&I is precalculated and reused, also shift for 1 and 2 bits and OR is combined with LEA with 2 and 4 as scale + offset in register which was ORed. Overall economy 4 instructions. Also XOR RDX,1 is used instead BTC RDX,0.
;btc rdx, 0 ; rdx = \A
; register content:
; rax = function result
; rbx, rcx free for calculations
; rdx = \A; rdi = \E; rsi = \I; r15 = K; r14 = J; r13 = I;
; r12 = G; r11 = F; r10 = E; r9 = C; r8 = B; rbp = A
; operators are: '&'=AND; '|'=OR; '\'=NOT
; Bit X = K|(C&I)|(G&A&I)
and rdx,r13 ; A&I
mov rbx, r9 ; rbx = C
mov rcx, r12 ; rcx = G
and rbx, r13 ; rbx = (C&I)
;and rcx, rbp ; rcx = G&A
or rbx, r15 ; rbx = K|(C&I)
;and rcx, r13 ; rcx = (G&A&I)
and rcx,edx
or rbx, rcx ; rbx = K|(C&I)|(G&A&I)
;shl rbx, 1 ; rbx = X
;or rax, rbx ; insert X
lea eax,[eax+ebx*2]
; Bit W = J|(B&I)|(F&A&I)
mov rbx, r8 ; rbx = B
mov rcx, r11 ; rcx = F
and rbx, r13 ; rbx = (B&I)
;and rcx, rbp ; rcx = (F&A)
or rbx, r14 ; rbx = J|(B&I)
;and rcx, r13 ; rcx = (F&A&I)
and rcx,edx
or rbx, rcx ; rbx = J|(B&I)|(F&A&I)
;shl rbx, 2 ; rbx = W
;or rax, rbx ; insert W
lea eax,[eax+ebx*4]
; At this point J and K ar no longer needed.
; We've now the following register content:
; rax = function result
; rbx, rcx, r15, r14 free for calculations
; rdx = \A; rdi = \E; rsi = \I; r13 = I;
; r12 = G; r11 = F; r10 = E; r9 = C; r8 = B; rbp = A
; Bit U = (A&I)|(C&E& \I)|(G& \E)
;mov rbx, rbp ; rbx = A
mov rcx, r9 ; rcx = C
mov r14, r12 ; r14 = G
;and rbx, r13 ; rbx = (A&I)
and r14, rdi ; r14 = (G& \E)
and rcx, r10 ; rcx = (C&E)
;or rbx, r14 ; rbx = (A&I)|(G& \E)
or rdx,r14
and rcx, rsi ; rcx = (C&E& \I)
;or rbx, rcx ; rbx = (A&I)|(C&E& \I)|(G& \E)
or rdx,rcx
;shl rbx, 4 ; rbx = U
shl rdx,4
;or rax, rbx ; insert U
or rax,rdx
mov rdx,rbp
xor rdx,1
Also we can combine ORing and shifting where the shifting amount is smaller than or equal to 3, like code below:
and rbx, IVMASK ; isolate V
; shl rbx, 1 ; rbx = H
; or rax, rbx ; insert H
lea rax,[rax+rbx*2]
There are 5 such places in Ch2Bcd so it's 5 instructions economy.
Also maybe some other places, like A&I in the first example above, maybe reused to avoid re-calculation, but the code seems to be very neat (and very well commented!) and hard to be optimized further :t