News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Application Binary Interface (ABI), calling conventions and the like

Started by jj2007, June 14, 2012, 12:24:30 AM

Previous topic - Next topic

jj2007

Agner Fog on using FPU and xmm regs in Win7-64:
QuoteHowever, a public discussion forum quotes the following answers from Microsoft engineers
regarding this issue: "From: Program Manager in Visual C++ Group, Sent: Thursday, May
26, 2005 10:38 AM. It does preserve the state. It's the DDK page that has stale information,
which I've requested it to be changed. Let them know that the OS does preserve state of
x87 and MMX registers on context switches." and "From: Software Engineer in Windows
Kernel Group, Sent: Thursday, May 26, 2005 11:06 AM. For user threads the state of legacy
floating point is preserved at context switch. But it is not true for kernel threads.
Kernel mode drivers can not use legacy floating point instructions." (www.planetamd64.com/index.php?showtopic=3458&st=100).

The issue has finally been resolved with the long overdue publication of a more detailed ABI
for x64 Windows in the form of a document entitled "x64 Software Conventions", well hidden
in the bin directory (not the help directory) of some compiler packages. This document says:
"The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are preserved across
context switches. There is no explicit calling convention for these registers. The use of
these registers is strictly prohibited in kernel mode code." The same text has later appeared
at the Microsoft website (msdn2.microsoft.com/en-us/library/a32tsf7t(VS.80).aspx).
My tests indicate that these registers are saved correctly during task switches and thread
switches in 64-bit mode, even in an early beta version of x64 Windows.

I like the red part. It somehow implies that the very latest version of the Windows kernel uses, well, "legacy floating point instructions" :biggrin:


qWord

I think they want to speed up context switches in kernel land. Maybe they also want prevent the slow transcendental function to increase the interruptibility of kernel code.
The question is which kind of driver needs FPU stuff? Basic FP-Arithmetic is still available through SSEx.
MREAL macros - when you need floating point arithmetic while assembling!

hutch--

From memory Microsoft abandoned FPU code some time ago for 64 bit versions, over time SSE will probably do the job if they extend the maths to 128 bit. FPU code can still handle numbers in the 80 bit range but it would seem that Intel also want to shift most maths to SSE rather than the now ancient FPU.

qWord

Quote from: hutch-- on June 14, 2012, 12:57:51 AM
From memory Microsoft abandoned FPU code some time ago for 64 bit versions, over time SSE will probably do the job if they extend the maths to 128 bit.
Only kernel code is affected. User mode applications can still use the FPU.
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

Quote from: hutch-- on June 14, 2012, 12:57:51 AM
From memory Microsoft abandoned FPU code some time ago for 64 bit versions, over time SSE will probably do the job if they extend the maths to 128 bit. FPU code can still handle numbers in the 80 bit range but it would seem that Intel also want to shift most maths to SSE rather than the now ancient FPU.

maybe that implies the future intentions of intel (assuming that intel and ms collaborate)
they may intend to phase it out over the next few generations of processors

jj2007

> Kernel mode drivers can not use legacy floating point instructions

They say "don't use them", not: "if you use them, preserve them". Who would be affected by "wrong" FPU values if not Kernel code itself? Or do I misunderstand something completely? Kernel-wise I am a noob...

dedndave

may have something to do with handling FPU exceptions
i thought you were an expert on that stuff   :biggrin:

qWord

For kernel threads, the FPU registers/status is not saved while a context switch occurs. That means that the whole FPU contents can change from one instruction to the next, if a context switch has occurred between them.
MREAL macros - when you need floating point arithmetic while assembling!


Zen

This is mind-boggling,...
As if kernel-mode programming wasn't confusing enough already,...

jj2007

I've done some tests checking how much it costs to save & restore the xmm regs that Win7-64 so merciless trashes:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
89      cycles for fxsave
83      cycles for fxrstor
152     cycles for fsave
113     cycles for frstor

89      cycles for fxsave
83      cycles for fxrstor
152     cycles for fsave
113     cycles for frstor


172 cycles on my puter. Looks like a lot but effectively they are needed only around some probably utterly slow Windows API calls.

dedndave

prescott w/htt - XP MCE2005 SP3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
162     cycles for fxsave
243     cycles for fxrstor
530     cycles for fsave
576     cycles for frstor

158     cycles for fxsave
243     cycles for fxrstor
528     cycles for fsave
578     cycles for frstor

MichaelW

P4 Northwood w/ht XP SP3

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
87      cycles for fxsave
207     cycles for fxrstor
443     cycles for fsave
526     cycles for frstor

84      cycles for fxsave
202     cycles for fxrstor
434     cycles for fsave
529     cycles for frstor


Interesting to see the drop in IPC for the Prescott compared to the Northwood.

Well Microsoft, here's another nice mess you've gotten us into.

hutch--

I had similar timing results back when I had both the Northwood and a Prescott as dev boxes, the 2.8 gig Northwood was usually faster than the 3 gig Prescott and had noticable less lag. Apparently the Prescott has a much longer pipeline.

jj2007

Getting faster...
Interesting that fsave/frstor is always much slower. The x variants save 512 bytes to memory.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
38      cycles for fxsave
73      cycles for fxrstor
166     cycles for fsave
130     cycles for frstor

38      cycles for fxsave
73      cycles for fxrstor
167     cycles for fsave
130     cycles for frstor