m64lib\xmmcopya bug?

sinsi · July 17, 2019, 07:41:27 PM

Runs into a problem if byte count < 16. If the first jle @f is taken, RAX is not initialised but the final copy loop uses it as a counter.

Code Select


xmmcopya proc

  ; *********************
  ;  aligned memory copy
  ; *********************

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

    mov r11, r8
    shr r11, 4                  ; div by 16 for loop count
    xor r10, r10                ; zero r10 to use as index

    cmp r8, 16
    jle @F


  lpst:
    movdqa xmm0, [rcx+r10]
    movntdq [rdx+r10], xmm0
    add r10, 16
    sub r11, 1
    jnz lpst

    mov rax, r8                 ; calculate remainder if any
    and rax, 15
    test rax, rax
    jnz @F
    ret

  @@:
    mov r9b, [rcx+r10]          ; copy any remainder
    mov [rdx+r10], r9b
    add r10, 1
    sub rax, 1
    jnz @B

    ret

xmmcopya endp

hutch-- · July 17, 2019, 10:25:24 PM

I have only just pulled my head out of a linear parser, probably the best way is to test the length and if its under 16 bytes, branch to a simple byte copy. I did not envisage it being used for very small byte counts.

hutch-- · July 18, 2019, 03:46:05 PM

If you look at the design of the algorithm, it requires 16 byte aligned memory for the particular choice of mnemonics and will crash without correct alignment as per the Intel manual data so it is not a choice for memory copy under 16 bytes in length. An algorithm of this type would be used for block streaming of aligned memory, not for small byte counts. On small amounts its hard to beat REP MOVSB.

sinsi · July 18, 2019, 04:40:21 PM

I realised later that it wasn't a general purpose routine, but some might see xmm and assume it's super-fast or something

The MASM Forum

News:

m64lib\xmmcopya bug?

sinsi

hutch--

hutch--

sinsi