News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

x64 Correct Syntax

Started by Elegant, July 22, 2015, 01:45:54 PM

Previous topic - Next topic

Elegant

Hi, I have three lines below in a single x86 procedure, they're relatively straight forward but I can't see my assumption of their x64 counterpart being correct:


;x86, 16 alignment
...
movq xmm0, qword ptr[eax]
movq xmm1, qword ptr[ecx]
...
movdqa xmm4, [esi]
...



;x64
...
movqda xmm0, [rax]
movqda xmm1, [rcx]
...
movdqa xmm4, [rsi]
...


Is that actually correct?


Gunther

Hi Elegant,

what's your problem?
MOVQ copies a quadword from the source operand (second operand) to the destination operand (first operand). The
source and destination operands can be MMX technology registers, XMM registers, or 64-bit memory locations.

MOVDQA Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand).

MOVQDA seems to be not valid.

By the way, there's not much difference between 32-bit and 64-bit programming by using the extended multi media register (xmm). Of course, under Win-32 you have only access to XMM0 to XMM7. That's a restriction.

Gunther
You have to know the facts before you can distort them.

Elegant

Weird thing is that movdqa xmm4, [esi] is actually working with no issue in this set of code. That's why I can't understand why my x64 version of that line also using movdqa, you'd think it would use something larger in that case similar to how movq becomes movdqa for the first two lines. The program is currently producing desired results as far as I can see but the concept is eluding me as to how that line with esi is actually correct.

Gunther

Hi Elegant,

could you post the entire code of your program? Another way is to make a test bed with the critical code section.

Gunther
You have to know the facts before you can distort them.

rrr314159

Elegant, Gunther's right of course: there's no movqda, and (should be) no problem here. To x64 this code you only need to change the e's to r's: eax, ecx, esi to rax, rcx, rsi (even that's not necessary with no large address aware option).

If "qword ptr" is required for x86 it's also required for x64, no difference (altho movq knows it's moving a quadword so normally it's not used; but maybe your assembler needs it?). For movdqa, it knows it's moving 128 bits so no "xmmword ptr" is needed (altho JWasm used to require it).

The problem is, you're thinking (apparently) that in moving to x64 those data transfers should double; if in x86 you're moving 64 bits, then x64 should move 128 bits. That's not the case; as Gunther says SIMD code basically doesn't change when moving from 32 to 64. The only exception, it's conceivable that the rest of the program is written such that the larger transfer becomes necessary. I doubt it very much! ... but that's why Gunther suggests posting the rest of the code, see if something strange like that is happening.

Very likely, just change the e's to r's and you're done.
I am NaN ;)

Elegant

I'll post what I'm using right now (mind you this is a long function), the x86 version of this is created from an inline version found in a cpp file and works. The goal is to translate that into x64 and tidy it up a bit. It deals with YUV color space in case you were wondering. There are 3 other conversion functions but they are very similar.

x86:

conv1_YV12_SSE2 proc public srcpY:dword,srcpU:dword,srcpV:dword,src_pitchR:dword,src_pitchY2:dword,src_pitchUV:dword,dstpY:dword,dstpU:dword,dstpV:dword,dst_pitchR:dword,dst_pitchY2:dword,dst_pitchUV:dword,width_:dword,height:dword,loopctr:dword,fact_YU:oword,fact_YV:oword,fact_UU:oword,fact_UV:oword,fact_VV:oword,fact_VU:oword

mov esi, srcpY
mov edi, dstpY
mov eax, srcpU
movd mm0, dstpU
mov ecx, srcpV
mov edx, dstpV
pxor xmm7, xmm7 ; all 0's
movdqa xmm6, oword ptr Q64 ; 64's words
align 16

xloop:
movq xmm0, qword ptr[eax]
movq xmm1, qword ptr[ecx]
punpcklbw xmm0, xmm7 ; unpack to words, 0U0U0U0U
punpcklbw xmm1, xmm7 ; unpack to words, 0V0V0V0V
psubw xmm0, oword ptr Q128 ; adj for 128 chroma offset
psubw xmm1, oword ptr Q128 ; adj for 128 chroma offset
pmullw xmm0, xmm6 ; *64, for rounding later
pmullw xmm1, xmm6 ; *64, for rounding later
movdqa xmm2, xmm0 ; copy so mm0 is stored for later
movdqa xmm3, xmm1 ; copy so mm1 is stored for later
pmulhw xmm2, fact_YU ; YU factor (U term in adjusted Y)
pmulhw xmm3, fact_YV ; YV factor (V term in adjusted Y)
paddw xmm2, xmm3 ; total adjusted amount to add to Y
movdqa xmm4, [esi]
movdqa xmm3, xmm2 ; make copy
punpcklwd xmm2, xmm2 ; words <1,1,0,0>
punpckhwd xmm3, xmm3 ; words <3,3,2,2>
movdqa xmm5, xmm4 ; make copy of it
punpcklbw xmm4, xmm7 ; 0Y0Y0Y0Y
punpckhbw xmm5, xmm7 ; 0Y0Y0Y0Y
pmullw xmm4, xmm6 ; *64
pmullw xmm5, xmm6 ; *64
paddw xmm4, xmm2 ; add uv adjustment
paddw xmm5, xmm3 ; add uv adjustment
paddw xmm4, oword ptr Q32 ; bump up 32 for rounding
paddw xmm5, oword ptr Q32 ; bump up 32 for rounding
psraw xmm4, 6 ; /64
psraw xmm5, 6 ; /64
packuswb xmm4, xmm5 ; pack back to 8 bytes, saturate to 0-255
movdqa [edi], xmm4
add esi, src_pitchR
add edi, dst_pitchR
movdqa xmm4, [esi]
movdqa xmm5, xmm4 ; make copy of it
punpcklbw xmm4, xmm7 ; 0Y0Y0Y0Y
punpckhbw xmm5, xmm7 ; 0Y0Y0Y0Y
pmullw xmm4, xmm6 ; *64
pmullw xmm5, xmm6 ; *64
paddw xmm4, xmm2 ; add uv adjustment
paddw xmm5, xmm3 ; add uv adjustment
paddw xmm4, oword ptr Q32 ; bump up 32 for rounding
paddw xmm5, oword ptr Q32 ; bump up 32 for rounding
psraw xmm4, 6 ; /64
psraw xmm5, 6 ; /64
packuswb xmm4, xmm5 ; pack back to 8 bytes, saturate to 0-255
movdqa [edi], xmm4
sub esi, src_pitchR ; restore curr line
sub edi, dst_pitchR ; restore curr line
movdqa xmm2, xmm0 ; mov back stored U words
movdqa xmm3, xmm1 ; mov back stored V words
paddw xmm2, xmm2 ; adjust for /2 scale in fact_UU
pmulhw xmm2, fact_UU ; UU factor (U term in adjusted U)
pmulhw xmm3, fact_UV ; UV factor (V term in adjusted U)
psubsw xmm2, xmm3 ; this is new U
movd mm1, eax
paddw xmm2, oword ptr Q8224 ; bias up by 64*128 + 32
psraw xmm2, 6 ; /64
movd eax, mm0
packuswb xmm2, xmm7 ; back to 4 bytes
movq qword ptr[eax], xmm2 ; store adjusted U
movdqa xmm2, xmm0 ; mov back stored U words
movdqa xmm3, xmm1 ; mov back stored V words
movd eax, mm1
paddw xmm3, xmm3 ; adjust for /2 scale in fact_VV
pmulhw xmm2, fact_VU ; VU factor (U term in adjusted V)
pmulhw xmm3, fact_VV ; VV factor (V term in adjusted V)
psubsw xmm3, xmm2 ; 1st term negative, this is new V
paddw xmm3, oword ptr Q8224 ; bias up by 64*128 + 32
psraw xmm3, 6 ; /64
packuswb xmm3, xmm7 ; pack to 4 bytes
movq qword ptr[edx], xmm3 ; store adjusted V
add esi, 16 ; bump ptrs
add edi, 16
add eax, 8
add ecx, 8
paddd mm0, Q8
add edx, 8
dec loopctr ; decrease counter
jnz xloop ; loop
sub height, 2
jz return
mov eax, width_
mov esi, srcpY
shr eax, 4
mov edi, dstpY
mov loopctr, eax
mov eax, srcpU
mov edx, dstpV
mov ecx, srcpV
add esi, src_pitchY2
add edi, dst_pitchY2
add eax, src_pitchUV
add ecx, src_pitchUV
add edx, dst_pitchUV
mov srcpY, esi
mov dstpY, edi
mov dstpV, edx
mov srcpU, eax
mov srcpV, ecx
movd mm1, eax
mov eax, dstpU
add eax, dst_pitchUV
mov dstpU, eax
movd mm0, eax
movd eax, mm1
jmp xloop

return:
ret

conv1_YV12_SSE2 endp


x64:

conv1_YV12_SSE2 proc public frame

src_pitchY2 equ dword ptr [rbp+48]
src_pitchUV equ dword ptr [rbp+56]
dstpY equ qword ptr [rbp+64]
dstpU equ qword ptr [rbp+72]
dstpV equ qword ptr [rbp+80]
dst_pitchR equ dword ptr [rbp+88]
dst_pitchY2 equ dword ptr [rbp+96]
dst_pitchUV equ dword ptr [rbp+104]
width_ equ dword ptr [rbp+112]
height equ dword ptr [rbp+120]
loopctr equ dword ptr [rbp+128]
fact_YU equ oword ptr [rbp+136]
fact_YV equ oword ptr [rbp+144]
fact_UU equ oword ptr [rbp+152]
fact_UV equ oword ptr [rbp+160]
fact_VV equ oword ptr [rbp+168]
fact_VU equ oword ptr [rbp+176]

push rbp
.pushreg rbp
mov rbp,rsp
push rsi
.pushreg rsi
push rdi
.pushreg rdi
push r12
.pushreg r12
push r13
.pushreg r13
push r14
.pushreg r14
push r15
.pushreg r15
sub rsp,96
.allocstack 96
movdqu oword ptr[rsp],xmm6
.savexmm128 xmm6,0
movdqu oword ptr[rsp+16],xmm7
.savexmm128 xmm7,16
movdqu oword ptr[rsp+32],xmm8
.savexmm128 xmm8,32
movdqu oword ptr[rsp+48],xmm9
.savexmm128 xmm9,48
movdqu oword ptr[rsp+64],xmm10
.savexmm128 xmm10,64
movdqu oword ptr[rsp+80],xmm11
.savexmm128 xmm11,80
.endprolog

mov rsi, rcx
mov rdi, dstpY
mov rax, rdx
mov r8, dstpU
mov rcx, r8
mov rdx, dstpV
movsxd r9, r9d
movsxd r10, dst_pitchR
movsxd r11, src_pitchY2
movsxd r12, dst_pitchY2
movsxd r13, src_pitchUV
movsxd r14, dst_pitchUV
mov r15d, width_

; 0x0020, Q32
pcmpeqb xmm8, xmm8
psrlw xmm8, 15
psllw xmm8, 5

; 0x0040, Q64
pcmpeqb xmm9, xmm9
psrlw xmm9, 15
psllw xmm9, 6

; 0x0080, Q128
pcmpeqb xmm10, xmm10
psrlw xmm10, 15
psllw xmm10, 7

; 0x2020, Q8224
pcmpeqb xmm11, xmm11
psrlw xmm11, 15
psllw xmm11, 13
por xmm11, xmm8

pxor xmm7, xmm7
movdqa xmm6, xmm9

xloop:
movdqa xmm0, [rax]
movdqa xmm1, [rcx]
punpcklbw xmm0, xmm7 ; unpack to words, 0U0U0U0U
punpcklbw xmm1, xmm7 ; unpack to words, 0V0V0V0V
psubw xmm0, xmm10 ; adj for 128 chroma offset
psubw xmm1, xmm10 ; adj for 128 chroma offset
pmullw xmm0, xmm6 ; *64, for rounding later
pmullw xmm1, xmm6 ; *64, for rounding later
movdqa xmm2, xmm0 ; copy so xmm0 is stored for later
movdqa xmm3, xmm1 ; copy so xmm1 is stored for later
pmulhw xmm2, fact_YU ; YU factor (U term in adjusted Y)
pmulhw xmm3, fact_YV ; YV factor (V term in adjusted Y)
paddw xmm2, xmm3 ; total adjusted amount to add to Y
movdqa xmm4, [rsi]
movdqa xmm3, xmm2 ; make copy
punpcklwd xmm2, xmm2 ; words <1,1,0,0>
punpckhwd xmm3, xmm3 ; words <3,3,2,2>
movdqa xmm5, xmm4 ; make copy of it
punpcklbw xmm4, xmm7 ; 0Y0Y0Y0Y
punpckhbw xmm5, xmm7 ; 0Y0Y0Y0Y
pmullw xmm4, xmm6 ; *64
pmullw xmm5, xmm6 ; *64
paddw xmm4, xmm2 ; add uv adjustment
paddw xmm5, xmm3 ; add uv adjustment
paddw xmm4, xmm8 ; bump up 32 for rounding
paddw xmm5, xmm8 ; bump up 32 for rounding
psraw xmm4, 6 ; /64
psraw xmm5, 6 ; /64
packuswb xmm4, xmm5 ; pack back to 8 bytes, saturate to 0-255
movdqa [rdi], xmm4
add rsi, r9
add rdi, r10
movdqa xmm4, [rsi]
movdqa xmm5, xmm4 ; make copy of it
punpcklbw xmm4, xmm7 ; 0Y0Y0Y0Y
punpckhbw xmm5, xmm7 ; 0Y0Y0Y0Y
pmullw xmm4, xmm6 ; *64
pmullw xmm5, xmm6 ; *64
paddw xmm4, xmm2 ; add uv adjustment
paddw xmm5, xmm3 ; add uv adjustment
paddw xmm4, xmm8 ; bump up 32 for rounding
paddw xmm5, xmm8 ; bump up 32 for rounding
psraw xmm4, 6 ; /64
psraw xmm5, 6 ; /64
packuswb xmm4, xmm5 ; pack back to 8 bytes, saturate to 0-255
movdqa [rdi], xmm4
sub rsi, r9 ; restore curr line
sub rdi, r12 ; restore curr line
movdqa xmm2, xmm0 ; mov back stored U words
movdqa xmm3, xmm1 ; mov back stored V words
paddw xmm2, xmm2 ; adjust for /2 scale in fact_UU
pmulhw xmm2, fact_UU ; UU factor (U term in adjusted U)
pmulhw xmm3, fact_UV ; UV factor (V term in adjusted U)
psubsw xmm2, xmm3 ; this is new U
paddw xmm2, xmm11 ; bias up by 64*128 + 32
psraw xmm2, 6 ; /64
packuswb xmm2, xmm7 ; pack back to 4 bytes
movdqa [r8], xmm2 ; store adjusted U
movdqa xmm2, xmm0 ; mov back stored U words
movdqa xmm3, xmm1 ; mov back stored V words
paddw xmm3, xmm3 ; adjust for /2 scale in fact_VV
pmulhw xmm2, fact_VU ; VU factor (U term in adjusted V)
pmulhw xmm3, fact_VV ; VV factor (V term in adjusted V)
psubsw xmm3, xmm2 ; 1st term negative, this is new V
paddw xmm3, xmm11 ; bias up by 64*128 + 32
psraw xmm3, 6 ; /64
packuswb xmm3, xmm7 ; pack back to 4 bytes
movdqa [rdx], xmm3 ; store adjusted V
add rsi, 16 ; bump ptrs
add rdi, 16
add rax, 8
add rcx, 8
add r8, 8
add rdx, 8
dec loopctr ; decrease counter
jnz xloop ; loop
sub height, 2
jz return
shr r15d, 4
mov loopctr, r15d
add rsi, r11
add rdi, r12
add rax, r13
add r8, r14
add rcx, r13
add rdx, r14
jmp xloop

return:
add rsp,96
pop r15
pop r14
pop r13
pop r12
pop rdi
pop rsi
pop rbp
ret

conv1_YV12_SSE2 endp

Elegant


qWord

Use a structure (-pointer) to pass the arguments. This will lead to easily readable and efficient code.
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Hi Elegant,

Quote from: Elegant on July 23, 2015, 04:20:31 AM
I'll post what I'm using right now (mind you this is a long function)

yes, indeed. Your 32-bit code uses on the on hand xmm registers and on the other hand mmx registers. Your 64-bit code doesn't use mmx. That's a difference.

But what exactly is your critical code section? And yes, qWord is right: Organize your data inside a structure and pass a pointer. That'll save a lot of trouble.

Gunther
You have to know the facts before you can distort them.

Elegant

I'll have to learn about structures, not something I've tinkered with; I think your suggestion about leaving the qword ptrs in x64 is the answer I was looking for. My ultimate goal (apart from building a x64 project) was to try and use the extra bits in x64 to get a more precise result in regards to YUV planes since the x86 version is not as accurate as the C version (value being off by +/- 1).

Gunther

Hi Elegant,

Quote from: Elegant on July 23, 2015, 12:35:30 PM
I'll have to learn about structures, not something I've tinkered with; I think your suggestion about leaving the qword ptrs in x64 is the answer I was looking for. My ultimate goal (apart from building a x64 project) was to try and use the extra bits in x64 to get a more precise result in regards to YUV planes since the x86 version is not as accurate as the C version (value being off by +/- 1).

if accuracy is what counts, you should use long double values (REAL10) with the good old FPU.

Gunther
You have to know the facts before you can distort them.