News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

AES-NI timings

Started by Kas, June 06, 2019, 01:03:52 AM

Previous topic - Next topic

Kas

Hello everyone,

My first post here so please forgive my English.

I have strange behavior with assembly code that i can't figure out and i appreciate your help.
I attached simple Delphi project that demonstrates this behavior and i am sure most of you can read the code in question without problem as i have difficulty converting it into MASM project, the unexplained behavior is in speed of AES-NI instructions execution, as you can see there is two function left that do actual AES encryption in CBC and CTR mode, both are correct logically, the project stripped from keyexpand and decryption as the behavior i noticed only in CBC encryption function:

const
  ONE_LE: array[0..15] of Byte = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1);
  ONE_BE: array[0..15] of Byte = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
  BSWAP_EPI64: array[0..15] of Byte = (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4,
    3, 2, 1, 0);

procedure EncryptAES_CTR_AESNI_BE(const Key, Input, Output: Pointer; const
  Length, NRounds: Integer; const IV: Pointer);
asm
        SHR     length, $4
        cmp     length, 0
        jnz     @NO_PARTS
        add     length, 01

@NO_PARTS:
        push    Input
        mov     Input, [IV]
        movdqu  xmm3, [Input]
        Pop     Input
        movdqu  xmm2, [BSWAP_EPI64]
        movdqu  xmm4, [ONE_BE]
        {cmp     NRounds, 12
        jz      @LOOP192
        cmp     NRounds, 14
        jz      @LOOP256   }
@LOOP128:
        movdqa  xmm1, xmm3
        /////////
        pxor    xmm1, [Key] + $00
        aesenc  xmm1, [Key] + $10
        aesenc  xmm1, [Key] + $20
        aesenc  xmm1, [Key] + $30
        aesenc  xmm1, [Key] + $40
        aesenc  xmm1, [Key] + $50
        aesenc  xmm1, [Key] + $60
        aesenc  xmm1, [Key] + $70
        aesenc  xmm1, [Key] + $80
        aesenc  xmm1, [Key] + $90
        aesenclast xmm1, [Key] + $A0
        /////////
        movdqu  xmm0, [Input]
        pxor    xmm1, xmm0
        movdqu  [Output], xmm1
        //
        pshufb  xmm3, xmm2             // BigEndian  inc Counter by 1
        paddq   xmm3, xmm4
        pshufb  xmm3, xmm2
        //
        add     Output, $10
        add     Input, $10
        dec     length
        jne     @LOOP128

@DONE:
        mov     Input, [IV]
        movdqu  [Input], xmm3           //  update IV
        //EMMS
end;

procedure EncryptAES_CBC_AESNI(const Key, Input, Output: Pointer; const Length,
  NRounds: Integer; const IV: Pointer);
asm
        SHR     length, $4
        cmp     length, 0
        jnz     @NO_PARTS
        add     length, 01

@NO_PARTS:
        sub     Output, $10
        push    Input
        mov     Input, [IV]
        movdqu  xmm1, [Input]
        Pop     Input

@LOOP128:
        //pxor xmm1,xmm1            ///  <------ ???!!!!
        movdqu  xmm0, [Input]
        pxor    xmm1, xmm0
        /////////
        pxor    xmm1, [Key] + $00
        aesenc  xmm1, [Key] + $10
        aesenc  xmm1, [Key] + $20
        aesenc  xmm1, [Key] + $30
        aesenc  xmm1, [Key] + $40
        aesenc  xmm1, [Key] + $50
        aesenc  xmm1, [Key] + $60
        aesenc  xmm1, [Key] + $70
        aesenc  xmm1, [Key] + $80
        aesenc  xmm1, [Key] + $90
        aesenclast xmm1, [Key] + $A0
        /////////
        add     Output, $10
        add     Input, $10
        dec     length
        movdqu  [Output], xmm1
        jne     @LOOP128
        jmp     @DONE

@DONE:
        mov     Input, [IV]            //  update IV
        movdqu  [Input], xmm1
        //emms
end;


building and running the attached application show me this result

QuoteAESNI   CBC 128 loop:100 buff:16777216  Size: 1677721600         ... Duration: 2446.79783139908 ms
AESNI   CTR 128 loop:100 buff:16777216  Size: 1677721600         ... Duration: 858.569035596996 ms
Done..

Now if i un-comment //pxor xmm1,xmm1 in EncryptAES_CBC_AESNI the result become

QuoteAESNI   CBC 128 loop:100 buff:16777216  Size: 1677721600         ... Duration: 763.221140856994 ms
AESNI   CTR 128 loop:100 buff:16777216  Size: 1677721600         ... Duration: 839.073241313889 ms
Done..

Why ?
Why clearing xmm1 before the actual encryption is faster almost 3 times, and even if i changed that loop to

@LOOP128:
        //pxor xmm1,xmm1            ///  <------ ???!!!!
        pxor    xmm1, [Key] + $00
        aesenc  xmm1, [Key] + $10
        aesenc  xmm1, [Key] + $20
        aesenc  xmm1, [Key] + $30
        aesenc  xmm1, [Key] + $40
        aesenc  xmm1, [Key] + $50
        aesenc  xmm1, [Key] + $60
        aesenc  xmm1, [Key] + $70
        aesenc  xmm1, [Key] + $80
        aesenc  xmm1, [Key] + $90
        aesenclast xmm1, [Key] + $A0
        dec     length
        jne     @LOOP128


the result still

QuoteAESNI   CBC 128 loop:100 buff:16777216  Size: 1677721600         ... Duration: 2416.72080163786 ms
AESNI   CTR 128 loop:100 buff:16777216  Size: 1677721600         ... Duration: 867.53402698143 ms
Done..

I tried (might be wrong though) code aligning and aligning even the loop itself without any success to identify the cause or explain the results.

So what is triggering this bizarre slowness ? and
How to predict it when i don't have something to compare to ? ( suggestions or resources to read and learn)


Thank you in advance

nidud

#1
deleted

Kas

Thank you nidud,

I just got my answer that solved and explained this behavior.
The behavior was due to CPU out-of-order execution, in this case the CPU will wait for aesenclast result to run next loop while with pxor xmm1,xmm1 the CPU can predict and execute another round in parallel although it is happening on xmm1 , in this case with this loop my CPU managed to execute 3 loops while your CPU executed 4.

The thing is without pxor xmm1,xmm1 the CPU perform with normal speed and with it went faster, while i was looking for the opposite thinking something is causing slowness, it was pxor that triggering the extra speed boost.
I wasn't expecting this boost at all for relatively long code, i thought out-of-order execution does happen on short and only few instruction.

aw27

This is all done in MASM here:
http://masm32.com/board/index.php?topic=7395.msg80831#msg80831

It should be faster than the BASM you presented.

nidud

#4
deleted