Hi All!
This is a little macro system to simulate pushs/pops. It's adapted from SmplMath results storing system. Work in 32/64 and can be thread safe/unsafe. Size of registers can be 1, 2, 4 or 8 bytes.
The idea is to work with SmplMath, but only 1 file is requiered from SmplMath and I included it here to make the system independent (and ML or JWASM).
other proc
local fregTLS()
conout " eax", tab
mov eax, 1350
freg_push eax
mov eax, 2264
freg_pop eax
conout str$(eax),lf,lf
ret
other endp
Howdy, your new console template here.
eax 1350
Press any key to continue...
Of course more test is needed :biggrin:
Regards, HSE.
Hi HSE
Good idea to use TLS for this purpose :thumbsup:
This can solve the problems on 64-bit with rsp-based frames. I think this solution may have some timing drawbacks compared to regular push/pop instructions. Did you do a benchmark?
Biterider
Hi Biterider!
Quote from: Biterider on March 30, 2021, 06:37:39 AM
This can solve the problems on 64-bit with rsp-based frames.
A lot better that some rsp-based frameworks I saw. If qWord solved storage in this way, can not be so bad :biggrin:
Quote from: Biterider on March 30, 2021, 06:37:39 AM
I think this solution may have some timing drawbacks compared to regular push/pop instructions. Did you do a benchmark?
No benmark, but in theory moving to/from calculated adresses always is slower than push/pop. The idea is to translate easily 32bit programs to 64 bit, then is interesting that the sytem work in 32 bit, just to be sure that 32 bit build work well before to build in 64 bit. The dual application have to pay some price in 64 bit (after testing, in 32 bit just make freg_xxx macro reg / xxx ® / endm ).
Regards, HSE.
Hi HSE
It's a brilliant idea. :icon_idea:
I think the penalties are all related to cache misses. Once the TLS is loaded it should perform the same way.
I think there is an alternative that I haven't tested or implemented yet. You can count the number of peudo-pushes and pseudo-pops, reserve some place on the stack (e.g. a local area) and save the content there. It has the benefit of not getting hit by the cache misses and it's thread safe too.
I think it's worth exploring.
Regards, Biterider
Quote from: HSE on March 30, 2021, 08:13:05 AMA lot better that some rsp-based frame I saw.
Is there any evidence that rsp-based frames are faster and/or shorter?
Hi Biterider!
Quote from: Biterider on March 30, 2021, 06:01:16 PM
It's a brilliant idea. :icon_idea:
I think the penalties are all related to cache misses. Once the TLS is loaded it should perform the same way.
Perhaps you are thinking something different, and that could be brillant :biggrin:
Quote from: Biterider on March 30, 2021, 06:01:16 PM
You can count the number of peudo-pushes and pseudo-pops, reserve some place on the stack (e.g. a local area) and save the content there.
It's what this system make :thumbsup:
Regards, HSE.
Nice :thumbsup:
Pseudo stack would also be nice for SIMD registers
In 32 bit,wonder pushes/ pops to get more registers usable,vs when milliseconds api calls inside loop +3 pushes +3 pops, vs use local variables Inc/Dec as loop counters?
Quote from: jj2007 on March 30, 2021, 07:31:35 PM
Quote from: HSE on March 30, 2021, 08:13:05 AMA lot better that some rsp-based frame I saw.
Is there any evidence that rsp-based frames are faster and/or shorter?
What I saw is the contrary. Rsp-based frameworks is larger and, from Agner Fog count, require more cycles. No motivation to make a test (but you can :biggrin: )
Quote from: daydreamer on March 30, 2021, 11:21:50 PM
Pseudo stack would also be nice for SIMD registers
In theory you have enough xmm registers :biggrin: I not included xmm registers, but perhaps I will (just in case some SIMD fanatic want to try that).
Quote from: daydreamer on March 30, 2021, 11:21:50 PM
In 32 bit,wonder pushes/ pops to get more registers usable
That is interesting because you can have 2 different piles. Rsp-based framesworks can not do that.
Quote from: HSE on March 30, 2021, 11:33:11 PMRsp-based frame is larger and, from Agner Fog count, require more cycles. No motivation to
Even in 64-bit code, all rsp-based moves are one byte longer. So what is the motivation to use rsp-based stack frames? I am curious because I see quite often discussions about them.
48 8B 45 04 | mov rax,qword ptr ss:[rbp+4] |
48 8B 85 90 01 00 00 | mov rax,qword ptr ss:[rbp+190] |
48 8B 44 24 04 | mov rax,qword ptr ss:[rsp+4] |
48 8B 84 24 90 01 00 00 | mov rax,qword ptr ss:[rsp+190] |
Hi
A good reason to stick to rsp frames is x64 exception handling.
In order to be able to unwind the code, the operating system needs to find the procedure frames. For this purpose it uses the rsp register and expects that it will not change within a procedure.
If you don't need exception handling, you can go e.g. with rbp frames or no frames at all.
Regards, Biterider
:biggrin: Sorry my English, I was thinking in frameworks, not frames.
Rsp-based frameworks use push and pop, but recalculate rsp.
deleted
deleted
Hi
As I read, when you write your own prologue/epilogue, the following applies:
If you fail to register unwind codes, then the system will assume that you are a lightweight leaf function, which means that it will assume that all nonvolatile registers are unmodified from the calling function, the stack pointer has not been changed from its value at function entry, and that the return address is in the default location. For x64, this means that the return address is at the top of the stack; for RISC, it means that the return address is in the standard return address register....and that will lead to unpredictable behavior.Quote from: nidud on March 31, 2021, 05:24:01 AM
I guess they end up in the department where Raymond Chen works :biggrin:
:biggrin:
Biterider
Quote from: Biterider on March 31, 2021, 03:15:55 AM
A good reason to stick to rsp frames is x64 exception handling.
https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-160
QuoteFrame register
If nonzero, then the function uses a frame pointer (FP), and this field is the number of the nonvolatile register used as the frame pointer, using the same encoding for the operation info field of UNWIND_CODE nodes.
Frame register offset (scaled)
If the frame register field is nonzero, this field is the scaled offset from RSP that is applied to the FP register when it's established. The actual FP register is set to RSP + 16 * this number, allowing offsets from 0 to 240. This offset permits pointing the FP register into the middle of the local stack allocation for dynamic stack frames, allowing better code density through shorter instructions. (That is, more instructions can use the 8-bit signed offset form.)
For timings, see Shadow space in 64-bit programming (http://masm32.com/board/index.php?topic=9227.msg101815#msg101815)
P.S.: If anybody knows what exactly they mean with "middle", please tell me, I am curious. It sounds good to have the full range, and a compiler can surely do it, but I can't see how to do it in with current assemblers. Like this maybe?
someproc
Local v1
mov rax, v1[Myoffset]
Hi!
There was a little problem if register is "assumed".
When "assumed", register is saved as dword in 32 bits, or as qword in 64 bits (JWasm family).
Updated in first post.
Regards.
Hi All!
Added pseudo push/pop of variables (not a lot in my 32 bit code, but there are some). It's a little more tricky because need a GPR to move value (by default are eax and R10 but you can use other):
freg_pushv [xax].SDLL_ITEM.pNextItem, R11
ยทยทยทยท
freg_pop xax
Also a not so automatic correction for unbalanced number of push/pop. That happen in conditional flow: freg_push xax
.if [xsi].BibBigMaster.options.TextEdition
invoke CheckMenuItem, xax, IDM_TEXT_ED, MF_UNCHECKED
freg_pop xax
invoke CheckMenuItem, xax, IDM_BLOCK_ED, MF_CHECKED
.else
invoke CheckMenuItem, xax, IDM_TEXT_ED, MF_CHECKED
freg_correction +1
freg_pop xax
invoke CheckMenuItem, xax, IDM_BLOCK_ED, MF_UNCHECKED
.endif
Uploaded in first post.
Regars, HSE.
Added pseudo peek and more complete example in first post.