News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

x64 rcx register stack pointer?

Started by Elegant, June 28, 2015, 05:30:07 PM

Previous topic - Next topic

Elegant

So I start with the following:


...
stride equ dword ptr [rbp+64]
...
mov rax,rcx
...


Eventually I fill up all the non-numerical registers (rax, rbx, ..., rbp), and then need to perform the following action:


...
mov rax,stride
...


Now the issue, I need to add the value of rax to the original value of rcx (which is now lost). I'm wondering if it's stored somewhere ("shadow space"?) or what the best of way of handling that is. I've tried swapping the contents of say rax and r11 (push/pop) to attempt to store the variable but that didn't yield the correct results. On my x86 version I would just use the parameter name that I do not have as it's put into rcx. Any help would be appreciated!

rrr314159

Hi Elegant,

It appears you're missing some basic things, if you already know these pls excuse.

rcx is not necessarily auto stored in "shadow space". You could use it, but ignore that for now.

The basic idea would be to define a place to put it, above all the proc's:

.data?
     rcxStore dq ?
.code

Actually it can be put almost anywhere in the file, but the top is traditional (I assume all your code is in one file).

Then save rcx there:

    mov rcxStore, rcx

and later add it (or whatever u want to do with it):

    add rax, rcxStore

You seem to be trying to keep everything in registers, that's not necessary. Usually keep all important variables in data statements. In fact, give it a meaningful name like "previous_stride" or "hypotenuse" etc, instead of "rcxStore", just as you would in a "real" language. The only important difference: u often have to transfer these data to a register to operate on them, unlike in C++. For instance, to move one named location to another. Instead of

    mov x, y   ; wrong

you must

    mov rax, x
    mov y, rax   ; or push/pop, whatever

There's (almost) no need to differentiate between "non-numeric" registers and r8, r9 etc, they're all the same. If you saved rcx in r11 and later restored it, just as I did above with "rcxStore", that will work also (as long as r11 doesn't get trashed elsewhere). You can also push rcx and pop it later. But the least confusing way is to give it a meaningful name and store it there.

BTW we're talking qwords here but it works the same with dwords; you just call them "r8d", r9d" etc.

Quote from: ElegantOn my x86 version I would just use the parameter name that I do not have as it's put into rcx

- As I showed above you can still use a parameter name like "rcxStore" or, better, "Customer_ID" (or whatever), you just have to define it in a .data segment and put rcx there. But you can also use procedures with an argument list, exactly as in x86, then refer to rcx by that name. JWasm takes care of that, but ML doesn't ... I rarely do it that way, don't like behind-the-scenes stack frames etc; but everybody else does.

Hope that helps
I am NaN ;)

Elegant

#2
I forgot about .data? of course... Nonetheless my issue still remains (that wasn't it then assuming my work is correct). Rather than continuing to fumble through porting this, I'll link my x86 and x64 version, maybe you can see something I can't with how x64 assembly works that I can't. Funny enough its the same function that gave me problems when extracting the inline in x86. This one function just loves to drive me crazy.

x86 (Works!):

.data
align 16
onesByte qword 2 dup(0101010101010101h)
.code
checkOscillation5_SSE2 proc p2p:dword,p1p:dword,s1p:dword,n1p:dword,n2p:dword,dstp:dword,stride:dword,width_:dword,height:dword,thresh:dword

public checkOscillation5_SSE2

mov eax,p2p
mov ebx,p1p
mov edx,s1p
mov edi,n1p
mov esi,n2p
pxor xmm6,xmm6
dec thresh
movd xmm7,thresh
punpcklbw xmm7,xmm7
punpcklwd xmm7,xmm7
punpckldq xmm7,xmm7
punpcklqdq xmm7,xmm7
yloop:
xor ecx,ecx
align 16
xloop:
movdqa xmm0,[eax+ecx]
movdqa xmm2,[ebx+ecx]
movdqa xmm1,xmm0
movdqa xmm3,xmm2
pminub xmm0,[edx+ecx]
pmaxub xmm1,[edx+ecx]
pminub xmm2,[edi+ecx]
pmaxub xmm3,[edi+ecx]
pminub xmm0,[esi+ecx]
pmaxub xmm1,[esi+ecx]
movdqa xmm4,xmm3
movdqa xmm5,xmm1
psubusb xmm4,xmm2
psubusb xmm5,xmm0
psubusb xmm4,xmm7
psubusb xmm5,xmm7
psubusb xmm2,oword ptr onesByte
psubusb xmm0,oword ptr onesByte
psubusb xmm1,xmm2
psubusb xmm3,xmm0
pcmpeqb xmm1,xmm6
pcmpeqb xmm3,xmm6
pcmpeqb xmm4,xmm6
pcmpeqb xmm5,xmm6
mov eax,dstp
por xmm1,xmm3
pand xmm4,xmm5
pand xmm1,xmm4
movdqa [eax+ecx],xmm1
add ecx,16
mov eax,p2p
cmp ecx,width_
jl xloop
mov eax,stride
add ebx,stride
add p2p,eax
add edx,stride
add edi,stride
add dstp,eax
add esi,stride
mov eax,p2p
dec height
jnz yloop

ret

checkOscillation5_SSE2 endp


x64 (Something is off...):

.code
;checkOscillation5_SSE2 proc p2p:dword,p1p:dword,s1p:dword,n1p:dword,n2p:dword,dstp:dword,stride:dword,width_:dword,height:dword,thresh:dword
; p2p = rcx
; p1p = rdx
; s1p = r8
; n1p = r9

checkOscillation5_SSE2 proc public frame

p2p equ qword ptr [rbp+16]
n2p equ qword ptr [rbp+48]
dstp equ qword ptr [rbp+56]
stride equ dword ptr [rbp+64]
width_ equ dword ptr [rbp+72]
height equ dword ptr [rbp+80]
thresh equ dword ptr [rbp+88]

push rbp
.pushreg rbp
mov rbp,rsp
push rbx
.pushreg rbx
push rsi
.pushreg rsi
push rdi
.pushreg rdi
sub rsp,64
.allocstack 64
movdqu oword ptr[rsp],xmm6
.savexmm128 xmm6,0
movdqu oword ptr[rsp+16],xmm7
.savexmm128 xmm7,16
movdqu oword ptr[rsp+32],xmm8
.savexmm128 xmm8,32
movdqu oword ptr[rsp+48],xmm9
.savexmm128 xmm9,48
.endprolog

mov p2p,rcx
mov rax,p2p
mov rbx,rdx
mov rdx,r8
mov rdi,r9
mov rsi,n2p
pxor xmm6,xmm6
dec thresh
movd xmm7,thresh
punpcklbw xmm7,xmm7
punpcklwd xmm7,xmm7
punpckldq xmm7,xmm7
punpcklqdq xmm7,xmm7
mov r10,16
pcmpeqb xmm9,xmm9
psrlw xmm9,15
movdqa xmm8,xmm9
psllw xmm8,8
por xmm9,xmm8
yloop:
xor rcx,rcx
align 16
xloop:
movdqa xmm0,[rax+rcx]
movdqa xmm2,[rbx+rcx]
movdqa xmm1,xmm0
movdqa xmm3,xmm2
movdqa xmm8,[rdx+rcx]
pminub xmm0,xmm8
pmaxub xmm1,xmm8
movdqa xmm8,[rdi+rcx]
pminub xmm2,xmm8
pmaxub xmm3,xmm8
movdqa xmm8,[rsi+rcx]
pminub xmm0,xmm8
pmaxub xmm1,xmm8
movdqa xmm4,xmm3
movdqa xmm5,xmm1
psubusb xmm4,xmm2
psubusb xmm5,xmm0
psubusb xmm4,xmm7
psubusb xmm5,xmm7
psubusb xmm2,xmm9
psubusb xmm0,xmm9
psubusb xmm1,xmm2
psubusb xmm3,xmm0
pcmpeqb xmm1,xmm6
pcmpeqb xmm3,xmm6
pcmpeqb xmm4,xmm6
pcmpeqb xmm5,xmm6
mov rax,dstp
por xmm1,xmm3
pand xmm4,xmm5
pand xmm1,xmm4
movdqa [rax+rcx],xmm1
add rcx,r10
mov rax,p2p
cmp ecx,width_
jl xloop
movsxd rax,stride
add ebx,stride
add p2p,rax
add edx,stride
add edi,stride
add dstp,rax
add esi,stride
mov rax,p2p
dec height
jnz yloop

pop rdi
pop rsi
pop rbx
pop rbp

ret

checkOscillation5_SSE2 endp

Yuri

As you write the prologue yourself, it's up to you to store the first 4 parameters. If the fastcall calling convention is used, there will always be space for them on the stack right after the return address (rbp+16 and so on in your code).

rrr314159

Elegant, as I mentioned u could use the shadow space to store rcx at rbp+16 as Yuri says. In a way, that's what it's there for. There are other issues of "style" with your code, but none of it amounts to much. As far as I can see there's nothing wrong with it, it ought to work. So I suspect the problem lies elsewhere (as it did last time). BTW Yuri, he didn't need to store rdx, r8 and r9, only rcx.

Same recommendations apply as before: you need a debugger, and could write a test-bed prog. I do have a new suggestion. Use the x86 code directly (almost). You only need to create the ref's for n2p, dstp etc and p2p. You don't even need the prologue, unless it's necessary when called from C++. All the other statements can remain as they are, literally, if you use linker option /LARGEADDRESSAWARE:NO. x64 is perfectly happy with 32-bit operations, no need to make them 64-bit. If you can't use that linker option only the addresses like [ecx] need to be changed to [rcx] (etc). That would reduce opportunities for error - although as I say there don't appear to be any errors.
I am NaN ;)

Elegant

Apparently I missed a mov p2p,rcx (I updated the earlier post with the rbp+16 being used instead of .data?)but the issue still remains. I'll look into setting up a debugger; you mentioned my "style" what would you change to improve upon it? I'm open to learning more as I'm relatively new to understanding assembly especially x64.

rrr314159

Woops you're right p2p wasn't getting initialized. BTW I never use prologues (let JWasm take care of it) so ignored that part, but notice you "sub rsp,64" and don't add it back later? It must not be necessary, which surprises me. If you already know about it, no need to explain.

Re style, it's largely a matter of opinion, but FWIW,

I wouldn't use the registers like that. You put "16" into r10 so you can add it to rcx later on, but actually it's quicker to add 16 directly, as an immediate. Even if it were slower I wouldn't bother. OTOH "stride" is accessed a lot: I would definitely put that in a register. Also dstp - instead of adding into memory in the loop, I'd use a register. In fact I might just put all the variables into registers - you've got 14 to play with!

BTW you can give a register a more meaningful name if u want, like this:

stride equ r11

Another minor point, I wouldn't bother to put [rdx+rcx] into xmm8 for just two statements. Speed will not be affected much, if at all, since [rdx+rcx] will be in cache; so it just confuses things.

Re. appearance I would indent the xloop once more (inside yloop) and make it visually obvious where both loops end (jl xloop), just as you have done with the loop beginnings.

And, many people would consider this the most important point: comments!!

But this is all trivial compared to getting it to work ...
I am NaN ;)

Elegant

Just thought I'd update this post so that it does have a happy ending. Debugging showed that I was getting the proper results just before exiting BUT! I forgot to undo the changes to xmm6-9 and the rsp.


movdqu xmm9,oword ptr[rsp+48]
movdqu xmm8,oword ptr[rsp+32]
movdqu xmm7,oword ptr[rsp+16]
movdqu xmm6,oword ptr[rsp]
add rsp,64
pop rdi
pop rsi
pop rbx
pop rbp

ret


THAT is how it should end, not just popping rdi and onward.