News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

m64lib\xmmcopya bug?

Started by sinsi, July 17, 2019, 07:41:27 PM

Previous topic - Next topic

sinsi

Runs into a problem if byte count < 16. If the first jle @f is taken, RAX is not initialised but the final copy loop uses it as a counter.

xmmcopya proc

  ; *********************
  ;  aligned memory copy
  ; *********************

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

    mov r11, r8
    shr r11, 4                  ; div by 16 for loop count
    xor r10, r10                ; zero r10 to use as index

    cmp r8, 16
    jle @F


  lpst:
    movdqa xmm0, [rcx+r10]
    movntdq [rdx+r10], xmm0
    add r10, 16
    sub r11, 1
    jnz lpst

    mov rax, r8                 ; calculate remainder if any
    and rax, 15
    test rax, rax
    jnz @F
    ret

  @@:
    mov r9b, [rcx+r10]          ; copy any remainder
    mov [rdx+r10], r9b
    add r10, 1
    sub rax, 1
    jnz @B

    ret

xmmcopya endp


hutch--

I have only just pulled my head out of a linear parser, probably the best way is to test the length and if its under 16 bytes, branch to a simple byte copy. I did not envisage it being used for very small byte counts.

hutch--

If you look at the design of the algorithm, it requires 16 byte aligned memory for the particular choice of mnemonics and will crash without correct alignment as per the Intel manual data so it is not a choice for memory copy under 16 bytes in length. An algorithm of this type would be used for block streaming of aligned memory, not for small byte counts. On small amounts its hard to beat REP MOVSB.

sinsi

I realised later that it wasn't a general purpose routine, but some might see xmm and assume it's super-fast or something :biggrin: