freg: Pseudo push/pop registers in 64 bits

HSE · March 30, 2021, 06:18:01 AM

Hi All!

This is a little macro system to simulate pushs/pops. It's adapted from SmplMath results storing system. Work in 32/64 and can be thread safe/unsafe. Size of registers can be 1, 2, 4 or 8 bytes.

The idea is to work with SmplMath, but only 1 file is requiered from SmplMath and I included it here to make the system independent (and ML or JWASM).

Code Select

other proc
    local fregTLS()
    
    conout "    eax", tab

    mov eax, 1350
    freg_push eax

    mov eax, 2264

    freg_pop eax
    conout str$(eax),lf,lf

    ret
other endp

Code Select

Howdy, your new console template here.

    eax 1350

Press any key to continue...

Of course more test is needed

Regards, HSE.

Biterider · March 30, 2021, 06:37:39 AM

Hi HSE
Good idea to use TLS for this purpose

This can solve the problems on 64-bit with rsp-based frames. I think this solution may have some timing drawbacks compared to regular push/pop instructions. Did you do a benchmark?

Biterider

HSE · March 30, 2021, 08:13:05 AM

Hi Biterider!

Quote from: Biterider on March 30, 2021, 06:37:39 AM
This can solve the problems on 64-bit with rsp-based frames.

A lot better that some rsp-based frameworks I saw. If qWord solved storage in this way, can not be so bad

Quote from: Biterider on March 30, 2021, 06:37:39 AM
I think this solution may have some timing drawbacks compared to regular push/pop instructions. Did you do a benchmark?

No benmark, but in theory moving to/from calculated adresses always is slower than push/pop. The idea is to translate easily 32bit programs to 64 bit, then is interesting that the sytem work in 32 bit, just to be sure that 32 bit build work well before to build in 64 bit. The dual application have to pay some price in 64 bit (after testing, in 32 bit just make freg_xxx macro reg / xxx &reg / endm ).

Regards, HSE.

Biterider · March 30, 2021, 06:01:16 PM

Hi HSE
It's a brilliant idea.

I think the penalties are all related to cache misses. Once the TLS is loaded it should perform the same way.

I think there is an alternative that I haven't tested or implemented yet. You can count the number of peudo-pushes and pseudo-pops, reserve some place on the stack (e.g. a local area) and save the content there. It has the benefit of not getting hit by the cache misses and it's thread safe too.
I think it's worth exploring.

Regards, Biterider

jj2007 · March 30, 2021, 07:31:35 PM

Quote from: HSE on March 30, 2021, 08:13:05 AMA lot better that some rsp-based frame I saw.

Is there any evidence that rsp-based frames are faster and/or shorter?

HSE · March 30, 2021, 11:07:29 PM

Hi Biterider!

Quote from: Biterider on March 30, 2021, 06:01:16 PM
It's a brilliant idea.
I think the penalties are all related to cache misses. Once the TLS is loaded it should perform the same way.

Perhaps you are thinking something different, and that could be brillant

Quote from: Biterider on March 30, 2021, 06:01:16 PM
You can count the number of peudo-pushes and pseudo-pops, reserve some place on the stack (e.g. a local area) and save the content there.

It's what this system make

Regards, HSE.

daydreamer · March 30, 2021, 11:21:50 PM

Nice

Pseudo stack would also be nice for SIMD registers

In 32 bit,wonder pushes/ pops to get more registers usable,vs when milliseconds api calls inside loop +3 pushes +3 pops, vs use local variables Inc/Dec as loop counters?

HSE · March 30, 2021, 11:33:11 PM

Quote from: jj2007 on March 30, 2021, 07:31:35 PM
Quote from: HSE on March 30, 2021, 08:13:05 AMA lot better that some rsp-based frame I saw.

Is there any evidence that rsp-based frames are faster and/or shorter?

What I saw is the contrary. Rsp-based frameworks is larger and, from Agner Fog count, require more cycles. No motivation to make a test (but you can

)

HSE · March 31, 2021, 12:01:46 AM

Quote from: daydreamer on March 30, 2021, 11:21:50 PM
Pseudo stack would also be nice for SIMD registers

In theory you have enough xmm registers

I not included xmm registers, but perhaps I will (just in case some SIMD fanatic want to try that).

Quote from: daydreamer on March 30, 2021, 11:21:50 PM
In 32 bit,wonder pushes/ pops to get more registers usable

That is interesting because you can have 2 different piles. Rsp-based framesworks can not do that.

jj2007 · March 31, 2021, 01:07:47 AM

Quote from: HSE on March 30, 2021, 11:33:11 PMRsp-based frame is larger and, from Agner Fog count, require more cycles. No motivation to

Even in 64-bit code, all rsp-based moves are one byte longer. So what is the motivation to use rsp-based stack frames? I am curious because I see quite often discussions about them.

Code Select

48 8B 45 04                   | mov rax,qword ptr ss:[rbp+4]    |
48 8B 85 90 01 00 00          | mov rax,qword ptr ss:[rbp+190]  |
48 8B 44 24 04                | mov rax,qword ptr ss:[rsp+4]    |
48 8B 84 24 90 01 00 00       | mov rax,qword ptr ss:[rsp+190]  |

Biterider · March 31, 2021, 03:15:55 AM

Hi
A good reason to stick to rsp frames is x64 exception handling.
In order to be able to unwind the code, the operating system needs to find the procedure frames. For this purpose it uses the rsp register and expects that it will not change within a procedure.

If you don't need exception handling, you can go e.g. with rbp frames or no frames at all.

Regards, Biterider

HSE · March 31, 2021, 03:50:11 AM

Sorry my English, I was thinking in frameworks, not frames.

Rsp-based frameworks use push and pop, but recalculate rsp.

nidud · March 31, 2021, 03:53:52 AM

deleted

nidud · March 31, 2021, 05:24:01 AM

deleted

Biterider · March 31, 2021, 07:22:24 AM

Hi
As I read, when you write your own prologue/epilogue, the following applies:

If you fail to register unwind codes, then the system will assume that you are a lightweight leaf function, which means that it will assume that all nonvolatile registers are unmodified from the calling function, the stack pointer has not been changed from its value at function entry, and that the return address is in the default location. For x64, this means that the return address is at the top of the stack; for RISC, it means that the return address is in the standard return address register.

...and that will lead to unpredictable behavior.

Quote from: nidud on March 31, 2021, 05:24:01 AM
I guess they end up in the department where Raymond Chen works

Biterider

The MASM Forum

News: