News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Having trouble figuring out how to create a mask for an XMM register

Started by Rav, June 14, 2022, 04:14:23 AM

Previous topic - Next topic

Rav

I have an XMM register consisting of 16 bytes, each byte being the value FF, like so:  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Let's say EAX contains a value from 1 to 15, which represents the number of high order bytes of the XMM register I want to zero out.  So for example:

If EAX was 1, the resulting XMM register would be:   00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
If EAX was 2, the resulting XMM register would be:   0000FFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
...
If EAX was 15, the resulting XMM register would be: 000000000000000000000000000000FF

This has to be done at runtime, so I can't use immediate operands (I had thought of using PSLLDQ and PSRLDQ to shift, but those use imm8).

You may be guessing that what I'm trying to do is create a mask which I will subsequently PAND with another XMM register, and that is exactly what I am doing.  If I could zero out EAX high-order bytes of an XMM register directly without even having to use a mask, that would be even better.

I have been trying to figure out what instruction(s) will do this for me, and keep being stumped.  This must run in 32-bit mode; I am able to use SSE 4.2 instructions.

Thanks for any ideas.  / Rav

jj2007

Self-modifying code would be an option. Thus, you could use the imm8.

daydreamer

self modifying SSE code is impossible with modern security settings
Rav check shufb byte shuffle instruction,beside emulating shift,there is also one way to zero bytes
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on June 14, 2022, 05:36:03 AM
self modifying SSE code is impossible with modern security settings

Sure.

include \masm32\MasmBasic\MasmBasic.inc
MyO OWORD 11223344556677889900AABBCCDDEEFFh
  Init
  movups xmm1, MyO
  deb 4, "start", x:xmm1
  mov ecx, 5
  .Repeat
mov ch, 0C3h
push ecx
mov ch, 0
push 0F9730F66h
call esp ; pslldq xmm1, ecx
add esp, 8
deb 4, Str$("#%i", ecx), x:xmm1
dec ecx
  .Until Zero?
  MsgBox 0, "With a very old OS like Windows 7-64, that works just fine", "Hi", MB_OK
EndOfCode


Output:
start   x:xmm1          11223344 55667788 9900AABB CCDDEEFF
#5      x:xmm1          66778899 00AABBCC DDEEFF00 00000000
#4      x:xmm1          00AABBCC DDEEFF00 00000000 00000000
#3      x:xmm1          CCDDEEFF 00000000 00000000 00000000
#2      x:xmm1          EEFF0000 00000000 00000000 00000000
#1      x:xmm1          FF000000 00000000 00000000 00000000

Rav

Thanks, it was a great idea, but yeah, I got an access violation (I did verify I was overwriting the correct bytes).  Unless there's something I'm doing wrong there, I'll look into shufb (thanks daydreamer).  Here is the code I tried:

    ; On entry, BL contains the number of bytes to shift.

    call GetEIP             ; Get EIP (can't access it directly).
    jmp AfterGetEIP     ; Jump over GetEIP code.

    ; Return value of EIP in ESI:
    GetEIP:
    mov esi,[esp]
    ret

    AfterGetEIP:
    add esi,6+ShiftXMM-GetEIP   ; Offset to where the first imm8 value is.
    mov [esi],bl                          ; Overwrite first imm8.
    mov [esi+5],bl                      ; Overwrite second imm8.

    ShiftXMM:
    PSLLDQ xmm4,91H ; Placeholder imm8 will be overwritten at runtime by self-modifying code.
    PSRLDQ xmm4,92H ; Same.

jj2007

Quote from: Rav on June 14, 2022, 06:27:30 AM
Thanks, it was a great idea, but yeah, I got an access violation

Well.... zero downloads of my code means you obviously haven't tested it. My code works like a charm, btw also on Windows 10, so I assume your version has a little problem. Happy bug chasing :tongue:

Rav

Quote from: jj2007 on June 14, 2022, 06:32:40 AM
Quote from: Rav on June 14, 2022, 06:27:30 AM
Thanks, it was a great idea, but yeah, I got an access violation

Well.... zero downloads of my code means you obviously haven't tested it. My code works like a charm, btw also on Windows 10, so I assume your version has a little problem. Happy bug chasing :tongue:

Sorry, I didn't see the download file name (it's displayed in a very small font).  I just downloaded it and tried it.  It does appear to work, but as a relative assembly newbie I don't understand the code.  I looked into using pshufb but it (apparently) would require a significant amount of set up before executing it, plus if I understand it correctly it would be doing excess work indexing (and yet not actually moving) the bytes I'm NOT zeroing.  I don't know if that description made sense.  At any rate, I'm going to try a different approach: table-driving the PAND mask.  There are only 15 possible masks that I need (00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, 0000FFFFFFFFFFFFFFFFFFFFFFFFFFFF ... 0000000000000000000000000000FFFF, and 000000000000000000000000000000FF), each 128 bits (16 bytes) long, so a table of 240 bytes doesn't concern me.  I'll just offset to one of the masks in memory and PAND with it.  At least that's the theory.  And it should be very fast.

jj2007

Quote from: Rav on June 14, 2022, 07:04:05 AMI'm going to try a different approach: table-driving the PAND mask.  There are only 15 possible masks that I need (00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, 0000FFFFFFFFFFFFFFFFFFFFFFFFFFFF ... 0000000000000000000000000000FFFF, and 000000000000000000000000000000FF), each 128 bits (16 bytes) long, so a table of 240 bytes doesn't concern me.  I'll just offset to one of the masks in memory and PAND with it.  At least that's the theory.  And it should be very fast.

Yep, that sounds like a good idea :thumbsup:

Quote from: Rav on June 14, 2022, 07:04:05 AMI don't understand the code

Let's have a look under the hood (the int 3 is to stop the debugger at this precise point):

  mov ecx, 5
  int 3
  .Repeat
mov ch, 0C3h
push ecx
mov ch, 0
push 0F9730F66h
call esp ; pslldq xmm1, 4
add esp, 8
dec ecx
  .Until Zero?


Address   Hex dump          Command
004011C2  |.  B9 05000000   mov ecx, 5
004011C7  |.  CC            int3
004011C8  |>  B5 C3         /mov ch, 0C3
004011CA  |.  51            |push ecx
004011CB  |.  B5 00         |mov ch, 0
004011CD  |.  68 660F73F9   |push F9730F66
004011D2  |.  FFD4          |call esp  <<<<<<<<<<<<<<<
004011D4  |.  83C4 08       |add esp, 8
004011D7  |.  49            |dec ecx
004011D8  |.^ 75 EE         \jnz short 004011C8


When calling esp, the cpu finds the following at esp:
0018FF84    660F73F9 04     pslldq xmm1, 4
0018FF89    C3              retn

Rav

jj2007, I THINK I understand, but I'm not sure.  You're actually pushing the instruction code (and the imm8 value) for pslldq onto the stack, then the call jumps to it and the CPU executes it there?  In other words, by calling esp, you're calling into the stack area rather than the code area?  Is that right?  And is that NOT an access violation because it's not modifying anything in the .code segment, but is modifying the stack, which is in the .data segment?  If that's right, is it perfectly permissible to execute code that resides in the .data segment rather than the .code segment?  I think I may still be confused.

jj2007


Rav

Quote from: jj2007 on June 14, 2022, 08:34:04 AM
You do understand - that's exactly what happens :tongue:

Thanks, and thanks for taking the time to explain.  / Rav

InfiniteLoop

Let rax = input mask 0 to 16, xmm0 = input vector

Attempt 1:
movdqu xmm1, xmmword ptr [MaskA + rax]
pand xmm0,xmm1
ret
MaskA: QWORD 0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0,0


I can't think of a way to create a mask using less than 5 cycles.