AVX2 __m256i (ymmword) variable byte shift.

InfiniteLoop · March 02, 2021, 07:54:37 PM

Quote from: jj2007 on February 22, 2021, 10:53:06 AM
No. See reply #9.

I tested it. Setting LAA to disabled causes any allocation >= 2Gb to fail.

In the process I discovered 2 bugs in the way c++ works.
(1) For a pointer symbol P, which is identical to &P[0], the c++ doe something stupid and crashes unless the latter is used.
(2) 64-bit integers used for addresses and bytes. Absolute nightmare. 1024*1024*1024*2 equals 2 billion but I must be wrong apparently because the compiler says it equals 18 quadrillion and tries to allocate those bytes.

This is the fastest method to shift right by edx bytes.

Code Select


;=====================================================================
; Shift Variable Right x bytes (RCX == input ymmword. RDX == count bytes)
;=====================================================================
ASM_VarShift_Right proc
vmovdqu ymm0, ymmword ptr [rcx]			;extract parameter
vmovd xmm2, edx
mov eax,15
vpbroadcastb ymm2, xmm2
vpaddb ymm1, ymm2, ymmword ptr Shuffle_Order
vmovd xmm3, eax
vpbroadcastb ymm3, xmm3
vpcmpgtb ymm4, ymm1, ymm3
vpand ymm2, ymm1, ymm3
vpshufb ymm0, ymm0, ymm2
vextracti128 xmm3, ymm0, 1
vpblendvb ymm0, ymm0, ymm3, ymm4
ret
Shuffle_Order BYTE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
ASM_VarShift_Right endp

jj2007 · March 02, 2021, 09:54:27 PM

Quote from: InfiniteLoop on February 22, 2021, 09:40:59 AMLargeAddressAware does nothing anymore. Its a 32-bit thing.

No, it's a 64-bit thing, too:

LAA ON: The address is at 40003018h
LAA OFF: The address is at 403018h

Code Select

include \Masm32\MasmBasic\Res\JBasic.inc	; OPT_64 1	; for 32 bit, 1 for 64 bit assembly
_number dq 1234567890123456789
Init
  ifidn @Environ(oDebugL), </Largeaddressaware:NO>
	Print "LAA OFF: "
  else
	Print "LAA ON: "
  endif
  Inkey Str$("The address is at %xh\n", offset _number)
EndOfCode

OPT_DebugL /Largeaddressaware:NO  ; disable to get LAA ON

Ralphy · February 15, 2022, 10:07:44 AM

Not sure if this quote from Mr. Hyde's book helps but..

"Large Address Unaware Applications

One advantage of 64-bit addresses is that they can access a frightfully large
amount of memory (something like 8TB under Windows). By default, the
Microsoft linker (when it links together the C++ and assembly language
code) sets a flag named LARGEADDRESSAWARE to true (yes). This makes it possible
for your programs to access a huge amount of memory. However, there is a
price to be paid for operating in LARGEADDRESSAWARE mode: the const component
of the [reg64 + const] addressing mode is limited to 32 bits and cannot
span the entire address space.

Because of instruction-encoding limitations, the const value is limited
to a signed value in the range ±2GB. This is probably far more than enough
when the register contains a 64-bit base address and you want to access
a memory location at a fixed offset (less than ±2GB) around that base
address. A typical way you would use this addressing mode is as follows:

lea rcx, someStructure
mov al, [rcx+fieldOffset]

Prior to the introduction of 64-bit addresses, the const offset appearing
in the (32-bit) indirect-plus-offset addressing mode could span the entire
(32-bit) address space. So if you had an array declaration such as

.data
buf byte 256 dup (?)

you could access elements of this array by using the following addressing
mode form:

mov al, buf[ebx] ; EBX was used on 32-bit processors

If you were to attempt to assemble the instruction mov al, buf[rbx] in
a 64-bit program (or any other addressing mode involving buf other than
PC-relative), MASM would assemble the code properly, but the linker would
report an error:
error LNK2017: 'ADDR32' relocation to 'buf' invalid without /LARGEADDRESSAWARE:NO

The linker is complaining that in an address space exceeding 32 bits,
it is impossible to encode the offset to the buf buffer because the machine
instruction opcodes provide only a 32-bit offset to hold the address of buf.

However, if we were to artificially limit the amount of memory that our
application uses to 2GB, then MASM can encode the 32-bit offset to buf into
the machine instruction. As long as we kept our promise and never used any
more memory than 2GB, several new variations on the indirect-plus-offset
and scaled-indexed addressing modes become possible.

To turn off the large address–aware flag, you need to add an extra command
line option to the ml64 command. This is easily done in the build.bat
file; let's create a new build.bat file and call it sbuild.bat. This file will have
the following lines:

echo off
ml64 /nologo /c /Zi /Cp %1.asm
cl /nologo /O2 /Zi /utf-8 /EHa /Fe%1.exe c.cpp %1.obj /link /largeaddressaware:no

This set of commands tells MASM to pass a
command to the linker that turns off the large address–aware file. MASM,
MSVC, and the Microsoft linker will construct an executable file that
requires only 32-bit addresses (ignoring the 32 HO bits in the 64-bit registers
appearing in addressing modes).

Once you've disabled LARGEADDRESSAWARE, several new variants of the
indirect-plus-offset and scaled-indexed addressing modes become available
to your programs:

variable[reg64]
variable[reg64 + const]
variable[reg64 - const]
variable[reg64 * scale]
variable[reg64 * scale + const]
variable[reg64 * scale - const]
variable[reg64 + reg_not_RSP64 * scale]
variable[reg64 + reg_not_RSP64 * scale + const]
variable[reg64 + reg_not_RSP64 * scale - const]

where variable is the name of an object you've declared in your source file
by using directives like byte, word, dword, and so on; const is a (maximum
32-bit) constant expression; and scale is 1, 2, 4, or 8. These addressing mode
forms use the address of variable as the base address and add in the current
value of the 64-bit registers."

hutch-- · March 14, 2022, 05:39:00 PM

You just need to learn how 64 bit works, There is a special MOV mnemonic that will load large addresses into a 64 bit register. There is no opcode to load a 64 bit sized immediate into a memory operand, you load it into a register first then copy the register into the memory operand.

Turning off the /LARGEADDRESSAWARE is a mistake, you get all of the quirks of Win64 while being restricted to 32 bit memory addressing range.

mikeburr · March 21, 2022, 09:41:00 PM

hutch
can you give an example of how to do this .. [theres plenty of stuff saying set largeaddress to no but nothing on largeaddressware that solves the issue]
eg
.data
message0 " 31char message... ",0
message1 2..31char message" ,0
....
...

....code

mov r15, 3
shl r15,5 ; get 5th message in array
lea rcx, [message0 + r15] ..> fails lnk2017

how can this be done ??
regards mike b

mikeburr · March 21, 2022, 10:14:06 PM

looks to be fairly simple
replace line
lea rcx, [message0 + r15]
by
mov rcx, offset message0
add rcx, r15 ; displacement
regards mikeb .. ps..havent tested yet but doesnt throw a lnk error so i guess it will work ok

hutch-- · March 21, 2022, 10:22:21 PM

Mike,

Using the /LARGEADDRESSAWARE means transferring immediates to a 64 bit register first, then transferring the register to a 64 bit operand.

mov rax, 1234567890123456 ; big number
mov var, rax

RE : AVX, I have done some work there and the majpr factor is data alignment. It will say nasty things to you and crash if you get that wrong. Effectively you align data for AVX to at least the size of the register. Check the actual instructions in the Intel manual.

Gunther · March 22, 2022, 03:20:56 AM

Quote from: hutch-- on March 21, 2022, 10:22:21 PM
RE : AVX, I have done some work there and the majpr factor is data alignment. It will say nasty things to you and crash if you get that wrong. Effectively you align data for AVX to at least the size of the register. Check the actual instructions in the Intel manual.

This is a very important hint. Even better is the alignment to 16 or even 32. It' just a few bytes that are given away by this.

mikeburr · March 23, 2022, 04:00:43 AM

thanks.. yes been aligning to 16 where possible ... turned out using largeaddressware required a few changes that proved a bit longwinded
[ 3 lines of code replacing 1 in the lnk error cases ] but working fine
regards mikeb

johnsa · April 03, 2022, 02:06:43 AM

There really isn't any reason to be using /LARGEADDRESSAWARE:NO for 64bit coding... that really defeats the object.

You want to get as much PIC (RIP relative) as possible.

branches/calls are all rip relative.
lea is inherently rip relative too and is your friend.

Use LEA to load the base address and then use it with another register as your index.
LEA rax,message0
mov rcx,(5*32) ; some offset amount
mov al,[rax+rcx]

variable[idx] type addressing that was common in 32bit asm code is a terrible idea now, on the plus side it means fewer relocations avoiding this.

TimoVJL · April 03, 2022, 02:49:49 AM

PIC coding is important in linux too, so why not accept that in Windows x64 too.

The MASM Forum

News:

AVX2 __m256i (ymmword) variable byte shift.

InfiniteLoop

jj2007

Ralphy

hutch--

mikeburr

mikeburr

hutch--

Gunther

mikeburr

johnsa

TimoVJL