News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Addressing an unusual problem with aligned ymm regs.

Started by hutch--, April 26, 2021, 09:24:40 PM

Previous topic - Next topic

hutch--


I have long had stackframe adjustments and when setting up a YMM test piece, if I wanted LOCAL YMM size variables to store Ymm registers in. The unaligned mnemonic vmovdqu works fine but the aligned version vmovdqa crashes every time.

The macro below tests the stack alignment directly after the LOCALS and the alignment is correct but it crashes anyway.

  ; -----------------------------------------------------------------------

    CheckStackAlign MACRO anum
      LOCAL pbuf, pout, obuf, buff, fnum, xvar
    .data?
      buff db 64 dup (?)
      obuf db 16 dup (?)
      fnum REAL8 ?
      xvar REAL8 ?
    .data
      pbuf dq ?
      pout dq ?
    .code
      mov pbuf, ptr$(buff)
      mov pout, ptr$(obuf)

      loadsd xmm0, anum
      movsd fnum, xmm0
      rcall fptoa,fnum,pout

      cvtsi2sd xmm0, rsp                          ;; load stack pointer into xmm0
      loadsd xmm1, anum                           ;; load anum into xmm1
      divsd xmm0, xmm1                            ;; divide rsp value by anum

      movsd xvar, xmm0
      rcall fptoa,xvar,pbuf

      conout "Stack pointer RSP divided by ",pout," = ",pbuf,lf, \
             "If number has no fraction, it is aligned by ",pout,lf,lf
    ENDM

  ; -----------------------------------------------------------------------

nidud

#1
deleted

TouEnMasm

I don't see where is use vmovdqa or vmovdqu ?
For everyone:
Quote
MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values
   Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
   EVEX encoded versions:
   Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand
   (the second operand) to the destination operand (first operand). This instruction can be used to load a vector
   register from a memory location, to store the contents of a vector register into a memory location, or to move data
   between two vector registers.
   The destination operand is updated at 8-bit (VMOVDQU8), 16-bit (VMOVDQU16), 32-bit (VMOVDQU32), or 64-bit
   (VMOVDQU64) granularity according to the writemask.
   VEX.256 encoded version:
   Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand
   (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the
   contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.
   Bits (MAXVL-1:256) of the destination register are zeroed.
   128-bit versions:
   Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand
   (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the
   contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
   128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
   When the source or destination operand is a memory operand, the operand may be unaligned to any alignment
   without causing a general-protection exception (#GP) to be generated
   VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.


Quote
2.3.5          The VEX Prefix
The VEX prefix is encoded in either the two-byte form (the first byte must be C5H) or in the three-byte form (the
first byte must be C4H). The two-byte VEX is used mainly for 128-bit, scalar, and the most common 256-bit AVX
instructions; while the three-byte VEX provides a compact replacement of REX and 3-byte opcode instructions
(including AVX and FMA instructions). Beyond the first byte of the VEX prefix, it consists of a number of bit fields
providing specific capability, they are shown in Figure 2-9.
The bit fields of the VEX prefix can be summarized by its functional purposes:
•   Non-destructive source register encoding (applicable to three and four operand syntax): This is the first source
    operand in the instruction syntax. It is represented by the notation, VEX.vvvv. This field is encoded using 1's
    complement form (inverted form), i.e. XMM0/YMM0/R0 is encoded as 1111B, XMM15/YMM15/R15 is encoded
    as 0000B.
•   Vector length encoding: This 1-bit field represented by the notation VEX.L. L= 0 means vector length is 128 bits
    wide, L=1 means 256 bit vector. The value of this field is written as VEX.128 or VEX.256 in this document to
    distinguish encoded values of other VEX bit fields.
•   REX prefix functionality: Full REX prefix functionality is provided in the three-byte form of VEX prefix. However
    the VEX bit fields providing REX functionality are encoded using 1's complement form, i.e. XMM0/YMM0/R0 is
    encoded as 1111B, XMM15/YMM15/R15 is encoded as 0000B.
    — Two-byte form of the VEX prefix only provides the equivalent functionality of REX.R, using 1's complement
      encoding. This is represented as VEX.R.
    — Three-byte form of the VEX prefix provides REX.R, REX.X, REX.B functionality using 1's complement
      encoding and three dedicated bit fields represented as VEX.R, VEX.X, VEX.B.
    — Three-byte form of the VEX prefix provides the functionality of REX.W only to specific instructions that need
      to override default 32-bit operand size for a general purpose register to 64-bit size in 64-bit mode. For
      those applicable instructions, VEX.W field provides the same functionality as REX.W. VEX.W field can
      provide completely different functionality for other instructions.
    Consequently, the use of REX prefix with VEX encoded instructions is not allowed. However, the intent of the
    REX prefix for expanding register set is reserved for future instruction set extensions using VEX prefix
    encoding format.
•   Compaction of SIMD prefix: Legacy SSE instructions effectively use SIMD prefixes (66H, F2H, F3H) as an
    opcode extension field. VEX prefix encoding allows the functional capability of such legacy SSE instructions
    (operating on XMM registers, bits 255:128 of corresponding YMM unmodified) to be encoded using the VEX.pp
    field without the presence of any SIMD prefix. The VEX-encoded 128-bit instruction will zero-out bits 255:128
    of the destination register. VEX-encoded instruction may have 128 bit vector length or 256 bits length.
•   Compaction of two-byte and three-byte opcode: More recently introduced legacy SSE instructions employ two
    and three-byte opcode. The one or two leading bytes are: 0FH, and 0FH 3AH/0FH 38H. The one-byte escape
    (0FH) and two-byte escape (0FH 3AH, 0FH 38H) can also be interpreted as an opcode extension field. The
    VEX.mmmmm field provides compaction to allow many legacy instruction to be encoded without the constant
    byte sequence, 0FH, 0FH 3AH, 0FH 38H. These VEX-encoded instruction may have 128 bit vector length or 256
    bits length.
The VEX prefix is required to be the last prefix and immediately precedes the opcode bytes. It must follow any other
prefixes. If VEX prefix is present a REX prefix is not supported.
The 3-byte VEX leaves room for future expansion with 3 reserved bits. REX and the 66h/F2h/F3h prefixes are
reclaimed for future use.
VEX prefix has a two-byte form and a three byte form. If an instruction syntax can be encoded using the two-byte
form, it can also be encoded using the three byte form of VEX. The latter increases the length of the instruction by
one byte. This may be helpful in some situations for code alignment.
The VEX prefix supports 256-bit versions of floating-point SSE, SSE2, SSE3, and SSE4 instructions. Note, certain
new instruction functionality can only be encoded with the VEX prefix.
The VEX prefix will #UD on any instruction containing MMX register sources or destinations.
Fa is a musical note to play with CL

hutch--

Yves,

The posting was a macro to test stack alignment. I have had the Intel manuals for many years so I miss your point.

nidud,

I understand how to align memory, its the stack alignment that I had to test, comes out perfect every time but the aligned version vmovdqa still crashes.

nidud

#4
deleted

hutch--

I solved the problem with a different stack frame design.