Hello everyone,
My first post here so please forgive my English.
I have strange behavior with assembly code that i can't figure out and i appreciate your help.
I attached simple Delphi project that demonstrates this behavior and i am sure most of you can read the code in question without problem as i have difficulty converting it into MASM project, the unexplained behavior is in speed of AES-NI instructions execution, as you can see there is two function left that do actual AES encryption in CBC and CTR mode, both are correct logically, the project stripped from keyexpand and decryption as the behavior i noticed only in CBC encryption function:
const
ONE_LE: array[0..15] of Byte = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1);
ONE_BE: array[0..15] of Byte = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
BSWAP_EPI64: array[0..15] of Byte = (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4,
3, 2, 1, 0);
procedure EncryptAES_CTR_AESNI_BE(const Key, Input, Output: Pointer; const
Length, NRounds: Integer; const IV: Pointer);
asm
SHR length, $4
cmp length, 0
jnz @NO_PARTS
add length, 01
@NO_PARTS:
push Input
mov Input, [IV]
movdqu xmm3, [Input]
Pop Input
movdqu xmm2, [BSWAP_EPI64]
movdqu xmm4, [ONE_BE]
{cmp NRounds, 12
jz @LOOP192
cmp NRounds, 14
jz @LOOP256 }
@LOOP128:
movdqa xmm1, xmm3
/////////
pxor xmm1, [Key] + $00
aesenc xmm1, [Key] + $10
aesenc xmm1, [Key] + $20
aesenc xmm1, [Key] + $30
aesenc xmm1, [Key] + $40
aesenc xmm1, [Key] + $50
aesenc xmm1, [Key] + $60
aesenc xmm1, [Key] + $70
aesenc xmm1, [Key] + $80
aesenc xmm1, [Key] + $90
aesenclast xmm1, [Key] + $A0
/////////
movdqu xmm0, [Input]
pxor xmm1, xmm0
movdqu [Output], xmm1
//
pshufb xmm3, xmm2 // BigEndian inc Counter by 1
paddq xmm3, xmm4
pshufb xmm3, xmm2
//
add Output, $10
add Input, $10
dec length
jne @LOOP128
@DONE:
mov Input, [IV]
movdqu [Input], xmm3 // update IV
//EMMS
end;
procedure EncryptAES_CBC_AESNI(const Key, Input, Output: Pointer; const Length,
NRounds: Integer; const IV: Pointer);
asm
SHR length, $4
cmp length, 0
jnz @NO_PARTS
add length, 01
@NO_PARTS:
sub Output, $10
push Input
mov Input, [IV]
movdqu xmm1, [Input]
Pop Input
@LOOP128:
//pxor xmm1,xmm1 /// <------ ???!!!!
movdqu xmm0, [Input]
pxor xmm1, xmm0
/////////
pxor xmm1, [Key] + $00
aesenc xmm1, [Key] + $10
aesenc xmm1, [Key] + $20
aesenc xmm1, [Key] + $30
aesenc xmm1, [Key] + $40
aesenc xmm1, [Key] + $50
aesenc xmm1, [Key] + $60
aesenc xmm1, [Key] + $70
aesenc xmm1, [Key] + $80
aesenc xmm1, [Key] + $90
aesenclast xmm1, [Key] + $A0
/////////
add Output, $10
add Input, $10
dec length
movdqu [Output], xmm1
jne @LOOP128
jmp @DONE
@DONE:
mov Input, [IV] // update IV
movdqu [Input], xmm1
//emms
end;
building and running the attached application show me this result
QuoteAESNI CBC 128 loop:100 buff:16777216 Size: 1677721600 ... Duration: 2446.79783139908 ms
AESNI CTR 128 loop:100 buff:16777216 Size: 1677721600 ... Duration: 858.569035596996 ms
Done..
Now if i un-comment //pxor xmm1,xmm1 in EncryptAES_CBC_AESNI the result become
QuoteAESNI CBC 128 loop:100 buff:16777216 Size: 1677721600 ... Duration: 763.221140856994 ms
AESNI CTR 128 loop:100 buff:16777216 Size: 1677721600 ... Duration: 839.073241313889 ms
Done..
Why ?
Why clearing xmm1 before the actual encryption is faster almost 3 times, and even if i changed that loop to
@LOOP128:
//pxor xmm1,xmm1 /// <------ ???!!!!
pxor xmm1, [Key] + $00
aesenc xmm1, [Key] + $10
aesenc xmm1, [Key] + $20
aesenc xmm1, [Key] + $30
aesenc xmm1, [Key] + $40
aesenc xmm1, [Key] + $50
aesenc xmm1, [Key] + $60
aesenc xmm1, [Key] + $70
aesenc xmm1, [Key] + $80
aesenc xmm1, [Key] + $90
aesenclast xmm1, [Key] + $A0
dec length
jne @LOOP128
the result still
QuoteAESNI CBC 128 loop:100 buff:16777216 Size: 1677721600 ... Duration: 2416.72080163786 ms
AESNI CTR 128 loop:100 buff:16777216 Size: 1677721600 ... Duration: 867.53402698143 ms
Done..
I tried (might be wrong though) code aligning and aligning even the loop itself without any success to identify the cause or explain the results.
So what is triggering this bizarre slowness ? and
How to predict it when i don't have something to compare to ? ( suggestions or resources to read and learn)
Thank you in advance