Hi Guys,
I am trying to convert my C/C++ frame rate counter to work in MASM. But I have come across a bit of a snag. I'm not sure how you go about handling 64 bit integers.
; Framerate counter
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeEnd
;TimeElapsed.QuadPart = TimeEnd.QuadPart - TimeStart.QuadPart;
;TimeElapsed.QuadPart *= 1000000000;
;TimeElapsed.QuadPart /= TimeFrequency.QuadPart; // in nanoseconds
inc nCounter
;if (TimeElapsed.QuadPart > 1000000000)
.if 1 ; placeholder
invoke itoa, nCounter, addr szBuffer, 10
invoke SetWindowText, hWnd, addr szBuffer
mov nCounter, 0
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeStart
.endif
Here is my partially converted code. The commented lines are C/C++
If anyone could assist, that would be truly appreciated 8)
If the number range is within DWORD then you probably only need to access the low DWORD of the 64 bit number. I gather this is 32 bit code ?
Yep 32 bit code.
Could you please give an example of how to access the low part of the DWORD?
Still getting my feet on the ground with the simple stuff. Tried a few different things but they don't compile.
Thanks again :)
I think I am a step closer, but I am on the edge of my knowledge of ASM here - LOL :bgrin:
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeEnd
;TimeElapsed.QuadPart = TimeEnd.QuadPart - TimeStart.QuadPart;
mov eax,DWORD PTR TimeEnd[0]
sub eax,DWORD PTR TimeStart[0]
mov ecx,DWORD PTR TimeEnd[+4]
sbb ecx,DWORD PTR TimeStart[+4]
mov DWORD PTR TimeElapsed[0], eax
mov DWORD PTR TimeElapsed[+4], ecx
;TimeElapsed.QuadPart *= 1000000000; // Not sure what to do here
;TimeElapsed.QuadPart /= TimeFrequency.QuadPart; // Not sure what to do here
inc nCounter
;if (TimeElapsed.QuadPart > 1000000000) // Not sure what to do here
.if 1 ; placeholder
invoke itoa, nCounter, addr szBuffer, 10
invoke SetWindowText, hWnd, addr szBuffer
mov nCounter, 0
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeStart
.endif
If anyone can assist in helping fill in the blanks, it would be truly awesome.
The simple solution:
NanoTimer (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1171)()
invoke Sleep, 1000
Inkey NanoTimer$()
But if you want to roll your own, it's good to know that the FPU understands perfectly what a QWORD integer is:
include \masm32\include\masm32rt.inc
.data?
timeStart dq ?
timeEnd dq ?
timeFrequency dq ?
timeElapsed dq ?
.code
start:
invoke QueryPerformanceFrequency, addr timeFrequency
invoke QueryPerformanceCounter, addr timeStart
invoke Sleep, 3000
invoke QueryPerformanceCounter, addr timeEnd
fild timeEnd
fild timeStart
fsub
fild timeFrequency
fdiv
fistp timeElapsed
inkey str$(dword ptr timeElapsed), " seconds elapsed"
exit
end start
i know the feeling....
I copied/pasted some related stuff from my own code, without much checking. But I hope some of it may help.
It may even have some bugs, although it's actually working ok.
.data
SchedulerMS dd 1 ; granularity for Sleep
PerfCountFreq dd 0
LastCounter dd 0
EndCounter dd 0
ElapsedCounter dd 0
tFPS dd 0
MSPerFrameR real8 0.0
SleepMS sdword 0
TargetSecPerFrame real8 16.0
In the code:
QueryPerformance... uses LONG INTEGER which is an Union, but you can deal with it straight as a dword:
....
inv QueryPerformanceFrequency, ADDR PerfCountFreq
in the game loop:
...
inv QueryPerformanceCounter, ADDR LastCounter
....
inv QueryPerformanceCounter, ADDR EndCounter
mov ecx, LastCounter
mov eax, EndCounter
sub eax, ecx
mov edx, 1000
mov ElapsedCounter, eax
mul edx
push eax
fild dword ptr [esp]
fidiv PerfCountFreq
fstp MSPerFrameR
mov eax, PerfCountFreq
cdq
div dword ptr ElapsedCounter
mov tFPS, eax
m2m LastCounter, EndCounter
[code]
This is a piece of timer code I use to calculate the FrameTime delta.
You only need the low 32bit part to calculate the time between screen refreshes.
FramesPerSecond = (1.0 / FrameTimeDelta )
.const
QPinteger struct
Low32bit dd ?
High32bit dd ?
QPinteger ends
float1 real4 1.0
.data?
align8
FrameTimeOld QPinteger <?>
FrameTimeNew QPinteger <?>
TicksPerSecondReciprocal real4 ?
FrameTimeDelta real4 ?
.code
InitTimer proc
invoke QueryPerformanceCounter,addr FrameTimeOld
invoke QueryPerformanceFrequency,addr FrameTimeNew
movss xmm0,float1
cvtsi2ss xmm1,FrameTimeNew.Low32bit
divss xmm0,xmm1
movss TicksPerSecondReciprocal,xmm0
ret
InitTimer endp
Update_frame proc
invoke QueryPerformanceCounter,addr FrameTimeNew
mov eax,FrameTimeNew.Low32bit
mov ecx,eax
sub eax, FrameTimeOld.Low32bit
mov FrameTimeOld.Low32bit,ecx
cvtsi2ss xmm0,eax
mulss xmm0,TicksPerSecondReciprocal
movss FrameTimeDelta,xmm0 ; FramesPerSecond == 1 / FrameTimeDelta
ret
Update_frame endp
EDIT: code adjustment!
Awesome, thanks for the advice. 8)
How would I compare TimeElapsed against 1000000000 to see if it is greater?
I can't use something like the following as it doesn't fit in EAX
mov eax, TimeElapsed
mov ebx, 1000000000
cmp ebx, eax
jg greater
Could you get away with just the low byte or something?
This is what I presently have but the code after the compare never gets executed.
The aim is to display the frame rate at one second intervals in the window title area.
Am I on the write track?
; Framerate counter (Work In Progress)
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeEnd
fild TimeEnd
fild TimeStart
fsub
fild TimeFrequency
fdiv
fistp TimeElapsed
inc nCounter
mov eax, DWORD PTR TimeElapsed[0]
mov ebx, 1000000000
cmp ebx, eax
jg skip
; ** Never gets called **
invoke itoa, nCounter, addr szBuffer, 10
invoke SetWindowText, hWnd, addr szBuffer
mov nCounter, 0
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeStart
skip:
I adjusted the code in my previous post.
The Update_frame proc would be something like this:
FrameCounter dd 0
TimeCounter real4 0.0
FrameTimeCounter real4 0.0
FramesPerSecond real4 0.0
invoke QueryPerformanceCounter,addr FrameTimeNew
mov eax,FrameTimeNew.Low32bit
mov ecx,eax
sub eax,FrameTimeOld.Low32bit
mov FrameTimeOld.Low32bit,ecx
cvtsi2ss xmm0,eax
mulss xmm0,TicksPerSecondReciprocal
movss FrameTimeDelta,xmm0 ; FPS = 1 / FrameTimeDelta
movss xmm1,TimeCounter
addss xmm1,xmm0
movss TimeCounter,xmm1
inc FrameCounter
movss xmm1,FrameTimeCounter
addss xmm1,xmm0
comiss xmm1,FLT4(1.0)
jb PerSecond
cvtsi2ss xmm0,FrameCounter
divss xmm0,xmm1
movss FramesPerSecond,xmm0 ; update per second
mov FrameCounter,0
xorps xmm1,xmm1
PerSecond:
movss FrameTimeCounter,xmm1
Marinus, not using FPU is a personal taste or there is any performance gain? As far as I read FPU still stands nicely, right?
edit to add: the reason I'm curious is because sometimes you also use FPU, got me thinking
Seem to have it working now :icon_cool:
Needed to throw a multiplication of 1000000000 in there.
; Framerate counter (Work In Progress)
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeEnd
fild TimeEnd
fild TimeStart
fsub
fild TimeNanoSecond ; 1000000000
fmul
fild TimeFrequency
fdiv
fistp TimeElapsed
inc nCounter
mov eax, DWORD PTR TimeElapsed[0]
mov ebx, 1000000000
cmp eax, ebx
jl skip
invoke itoa, nCounter, addr szBuffer, 10
invoke SetWindowText, hWnd, addr szBuffer
mov nCounter, 0
invoke QueryPerformanceFrequency, addr TimeFrequency
invoke QueryPerformanceCounter, addr TimeStart
skip:
I must be missing some optimisation techniques somewhere as my C++ loop (using the same render code) is 1000 FPS faster than the ASM loop.
C++ render loop is ~7000 FPS
ASM render loop is ~6000 FPS
Not a bad comparison though.
one think I noticed is (as far as I know) you only need to invoke queryperformancefrequency once, outside and prior to the loop.
You will be receiving the same value all the time.
True. I could take out one of the calls.
But if I take out both (and place a single call prior to the loop) systems that throttle clock speed (the ones that are too smart for their own good) will get incorrect results.
Quote from: Lonewolff on April 12, 2018, 06:50:55 PMI must be missing some optimisation techniques somewhere as my C++ loop (using the same render code) is 1000 FPS faster than the ASM loop.
Check where the bottleneck is... as far as the timing functions are concerned, they have low overhead, but you could, for example,
- call frequency only once before the loop (it won't change)
- if you use it inside the loop, use QueryPerformanceCounter only once (old end = new start time)
Quote from: Lonewolff on April 12, 2018, 07:03:30 PM
True. I could take out one of the calls.
But if I take out both (and place a single call prior to the loop) systems that throttle clock speed (the ones that are too smart for their own good) will get incorrect results.
thats's true too, have you benchmarked with and without it? I'm way from the computer but got curious
About the same.
I think it is more the 'DX11 render cycle' that is the bottleneck.
Just had a thought though. The ASM version is built with ML and Link that is supplied with MASM32. But the C++ version is built with the versions supplied with VS2017. I wonder if that is the source of the difference.
Gonna grab something to eat and I'll report back when I build the ASM versions with the 2017 compiler. :icon_cool:
Quote from: LordAdef on April 12, 2018, 06:41:29 PM
Marinus, not using FPU is a personal taste or there is any performance gain? As far as I read FPU still stands nicely, right?
edit to add: the reason I'm curious is because sometimes you also use FPU, got me thinking
Hi Alex,
When coding graphics and audio, I mainly use SIMD and not FPU because it can move more data around at greater speed.
When possible I don't mix SIMD and FPU that's why I used scalar SIMD for the timer code.
Quote from: Lonewolff on April 12, 2018, 06:50:55 PM
I must be missing some optimisation techniques somewhere as my C++ loop (using the same render code) is 1000 FPS faster than the ASM loop.
C++ render loop is ~7000 FPS
ASM render loop is ~6000 FPS
Not a bad comparison though.
Are the message pump loops the same for ASM and C++?
Quote from: Lonewolff on April 12, 2018, 07:03:30 PM
True. I could take out one of the calls.
But if I take out both (and place a single call prior to the loop) systems that throttle clock speed (the ones that are too smart for their own good) will get incorrect results.
Never noticed that ( my system does throttle the clock speed )
Quote from: Siekmanski on April 12, 2018, 07:18:29 PM
Are the message pump loops the same for ASM and C++?
Yep, making sure I keep the code the same so we are comparing apples with apples.
Just changed all of the libs to the 2017 SDK versions and the frame rate is now on par with the C++ version.
Couldn't compile with the 2017 ML.exe as it is complaining about invalid operands on a couple of my calls. Will look a bit closer to see if I am doing something wrong on that front.
Just curious, was it the d3d11.lib ?
Was already using d3d11.lib from the SDK.
Copied the others across - gdi32.Lib, kernel32.Lib, and user32.Lib.
Not sure why the new version of ML.exe doesn't like the project though. Something to do with the coinvoke macro?
Quote
error A2070:invalid instruction operands coinvoke(16): Macro Called From project.asm(228): Main Line Code
error A2070:invalid instruction operands coinvoke(16): Macro Called From project.asm(233): Main Line Code
error A2070:invalid instruction operands coinvoke(16): Macro Called From project.asm(238): Main Line Code
The corresponding lines of code;
(line 228) coinvoke d3dDevice, ID3D11Device, CreateVertexShader, addr vertexShaderData, SIZEOFvertexShaderData, NULL, addr d3dVertexShader
(line 233) coinvoke d3dDevice, ID3D11Device, CreatePixelShader, addr pixelShaderData, SIZEOFpixelShaderData, NULL, addr d3dPixelShader
(line 238) coinvoke d3dDevice, ID3D11Device, CreateInputLayout, addr inputDescP, 1, addr vertexShaderData, SIZEOFvertexShaderData, addr d3dInputLayout
[edit]
Worked it out. The new compiler doesn't like the way I am doing 'sizeof'. I'll work that out another day :biggrin:
ole32.lib perhaps?
(SIZEOF vertexShaderData)
Quote from: Siekmanski on April 12, 2018, 08:10:21 PM
ole32.lib perhaps?
(SIZEOF vertexShaderData)
Yeah 'SIZEOF vertexShaderData' doesn't work because the declaration is multi-line (Shader is hard coded at present)
vertexShaderData db 68,88,66,67,166,109,78,113,107,98,65,70,91,88,250,161,103,22,241,76,1,0,0,0,16,2,0,0,6,0,0,0,56,0,0,0,156,0,0,0,224,0,0,0,92,1,0,0
db 168,1,0,0,220,1,0,0,65,111,110,57,92,0,0,0,92,0,0,0,0,2,254,255,52,0,0,0,40,0,0,0,0,0,36,0,0,0,36,0,0,0,36,0,0,0,36,0,1
db 0,36,0,0,0,0,0,1,2,254,255,31,0,0,2,5,0,0,128,0,0,15,144,4, 0,0,4,0,0,3,192,0,0,255,144,0,0,228,160,0,0,228,144,1,0,0,2,0,0
db 12,192,0,0,228,144,255,255,0,0,83,72,68,82,60,0,0,0,64,0,1,0,15,0,0,0,95,0,0,3,242,16,16,0,0,0,0,0,103,0,0,4,242,32,16,0,0,0,0
db 0,1,0,0,0,54,0,0,5,242,32,16,0,0,0,0,0,70,30,16,0,0,0,0,0,62,0,0,1,83,84,65,84,116,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0
db 2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,82,68,69,70,68,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,28,0,0,0,0,4,254,255,0,1,0,0,28,0,0,0,77,105,99,114,111,115,111
db 102,116,32,40,82,41,32,72,76,83,76,32,83,104,97,100,101,114,32,67,111,109,112,105,108,101,114,32,49,48,46,49,0,73,83,71,78,44,0,0,0,1,0,0,0,8,0,0,0
db 32,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,15,15,0,0,80,79,83,73,84,73,79,78,0,171,171,171,79,83,71,78,44,0,0,0,1,0,0,0,8
db 0,0,0,32,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,15,0,0,0,83,86,95,80,79,83,73,84,73,79,78,0
SIZEOFvertexShaderData EQU $-vertexShaderData
So this is how I am calculating 'sizeof' until I code a better solution.
New compiler doesn't like that very much, where the old one seems ok with it.
Project doesn't link to ole32.lib.
An old problem with recent Micros**t assemblers. Try this:
mov ecx, SIZEOFpixelShaderData
coinvoke d3dDevice, ID3D11Device, CreateVertexShader, addr vertexShaderData, ecx, NULL, addr d3dVertexShader
If that doesn't work:
vertexShaderData db 68 ....
vertexShaderDataEnd db 0
...
mov ecx, vertexShaderDataEnd
sub ecx, vertexShaderData
coinvoke d3dDevice, ID3D11Device, CreateVertexShader, addr vertexShaderData, ecx, NULL, addr d3dVertexShader
Thought your data was in a structure member, than you can use (sizeof vertexShaderData)
Your solution should work too.
Thanks JJ2007, a couple of things to try.
Nah Siekmanski, my data is all nasty and flapping about at the moment - LOL
-> Project doesn't link to ole32.lib.
CoInitialize and CoUninitialize do need ole32.lib, needed for COM.
The API is simple to use.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL var1 :DWORD
push esi
push edi
lea esi, var1 ; load the address
mov edi, 100 ; set the counter
@@:
invoke QueryPerformanceCounter, esi ; call the API
print str$([esi]),13,10 ; display low DWORD
sub edi, 1 ; decrement counter
jnz @B ; loop again if not 0
pop edi
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
var1 should be QWORD size
deleted
You are right, it should be in 32 bit,
LOCAL var1 :DWORD
LOCAL dumm :DWORD
I added it later.
deleted
you might use the LARGE_INTEGER structure, as defined in windows.inc
LARGE_INTEGER UNION
STRUCT
LowPart DWORD ?
HighPart DWORD ?
ENDS
QuadPart QWORD ?
LARGE_INTEGER ENDS
handy, because you can access the values as either 2 DWORDs or 1 QWORD
LOCAL liPerfCtr :LARGE_INTEGER
Wait,
what Hutch was first doing is what I am doing, simply passing the first dword straight away is already what we want. we don't need any extra work unless we want the full qword.
Or I am missing something?
deleted
But I mean,
queryperformancecounter, addr temp
mov eax, temp
that's the dword we need, right? The low dword is the first data in the union, so this should suffice
Most of the time the DWORD is enough; occasionally, you'll get overflow, though.
But what's wrong with the complete solution (http://masm32.com/board/index.php?topic=7060.msg75722#msg75722) that I posted in reply #4? Plain Masm32 8)
Btw if you need milliseconds instead of seconds, just insert a line as shown below, and adjust the unit:include \masm32\include\masm32rt.inc
.data?
timeStart dq ?
timeEnd dq ?
timeFrequency dq ?
timeElapsed dq ?
.code
start:
invoke QueryPerformanceFrequency, addr timeFrequency
invoke QueryPerformanceCounter, addr timeStart
invoke Sleep, 300
invoke QueryPerformanceCounter, addr timeEnd
fild timeEnd
fild timeStart
fsub
fild timeFrequency
fdiv
fmul FP4(1000.0) ; to get milliseconds instead of seconds
fistp timeElapsed
inkey str$(dword ptr timeElapsed), " ms elapsed"
exit
end start
Quote from: Siekmanski on April 12, 2018, 08:52:41 PM
-> Project doesn't link to ole32.lib.
CoInitialize and CoUninitialize do need ole32.lib, needed for COM.
Weird. Definitely not linking to this yet COM is working perfectly.
Maybe something in d3d11.lib?
I don't call CoInitialize or CoUnititialize anywhere either. Not even in my C++ code. Haven't had to do that since DX9. Maybe DX11 does this under the hood?
[edit]
Arggh! My bad. I am including masm32rt.inc which links to ole32.lib.
I'm just following the rules. :biggrin:
EDIT: I'm not certain now if it is really necessary for DirectX. ::)
I never used CoCreateInstance as far as I can remember in my DirectX code.
Only used it in DirectShow I think.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms678543(v=vs.85).aspx
Guys,
Allow me to keep this topic going for a bit longer.
Since Hutch, Marinus, JJ and I are going through different routes, I wonder how the benchmarks behave (not done it yet). But an interesting one though.
I'm dealing with dwords and doing the stuff in cpu prior to FPU. Marinus is going SIMD with dd, and JJ is full dq.
My approach was:
.data
SchedulerMS dd 1 ; granularity for Sleep
PerfCountFreq dd 0
LastCounter dd 0
EndCounter dd 0
ElapsedCounter dd 0
tFPS dd 0
MSPerFrame real8 0.0
SleepMS sdword 0
TargetSecPerFrame real8 16.0
.code
inv timeBeginPeriod, SchedulerMS
.IF eax != TIMERR_NOERROR
console "ATTENTION: timeBeginPeriod failed!" ; (console is my printf macro)
.ENDIF
inv QueryPerformanceFrequency, ADDR PerfCountFreq
inv QueryPerformanceCounter, ADDR LastCounter
; prog loop starts
;;; code here
inv QueryPerformanceCounter, ADDR EndCounter
mov ecx, LastCounter
mov eax, EndCounter
sub eax, ecx
mov edx, 1000
mov ElapsedCounter, eax
mul edx
push eax
fild dword ptr [esp]
fidiv PerfCountFreq
fstp MSPerFrame
mov eax, PerfCountFreq
cdq
div dword ptr ElapsedCounter
mov tFPS, eax
fld TargetSecPerFrame
fsub MSPerFrame
fistp SleepMS
cmp SleepMS, 0
jle done
inv Sleep, dword ptr [SleepMS]
done:
[code]
By natural selection, I must be running far behind...but who knows... any comments?
edit to organize the code
QuoteI'm dealing with dwords and doing the stuff in cpu prior to FPU. Marinus is going SIMD with dd, and JJ is full dq.
Yeah, everybody has his own coding style. :badgrin:
PerfCountFreq, LastCounter, EndCounter should be QWORD size.
Now they overwrite each other and ElapsedCounter also.
Maybe better to keep it all in the FPU then you can also make a reciprocal of PerfCountFreq and get rid of the fidiv instruction and replace it with fmul.
QuoteMaybe better to keep it all in the FPU then you can also make a reciprocal of PerfCountFreq and get rid of the fidiv instruction and replace it with fmul.
This is the Jochen! nice
Quote from: LordAdef on April 13, 2018, 10:26:40 PM
QuoteMaybe better to keep it all in the FPU then you can also make a reciprocal of PerfCountFreq and get rid of the fidiv instruction and replace it with fmul.
This is the Jochen! nice
Or the Marinus if you like SIMD :biggrin:
I presume your goal is to use the timers for your Games am I right?
If you like it I could post an example of my multimedia timers.
It handles TotalTime, TimeElapsed, FramesPerSecond, FrameTimeDelta and 15 additional resettable timers for game events.
But it is written in SIMD. :biggrin:
Quote from: Siekmanski on April 13, 2018, 10:52:32 PM
I presume your goal is to use the timers for your Games am I right?
If you like it I could post an example of my multimedia timers.
It handles TotalTime, TimeElapsed, FramesPerSecond, FrameTimeDelta and 15 additional resettable timers for game events.
But it is written in SIMD. :biggrin:
Yes! And it's in the main loop, so it must be optimized. I would love if you could do that! Thanks Marinus
Multimedia timers in action. :biggrin:
Quote from: Siekmanski on April 12, 2018, 06:33:30 PM
I adjusted the code in my previous post.
The Update_frame proc would be something like this:
FrameCounter dd 0
TimeCounter real4 0.0
FrameTimeCounter real4 0.0
FramesPerSecond real4 0.0
invoke QueryPerformanceCounter,addr FrameTimeNew
mov eax,FrameTimeNew.Low32bit
mov ecx,eax
sub eax,FrameTimeOld.Low32bit
mov FrameTimeOld.Low32bit,ecx
cvtsi2ss xmm0,eax
mulss xmm0,TicksPerSecondReciprocal
movss FrameTimeDelta,xmm0 ; FPS = 1 / FrameTimeDelta
movss xmm1,TimeCounter
addss xmm1,xmm0
movss TimeCounter,xmm1
inc FrameCounter
movss xmm1,FrameTimeCounter
addss xmm1,xmm0
comiss xmm1,FLT4(1.0)
jb PerSecond
cvtsi2ss xmm0,FrameCounter
divss xmm0,xmm1
movss FramesPerSecond,xmm0 ; update per second
mov FrameCounter,0
xorps xmm1,xmm1
PerSecond:
movss FrameTimeCounter,xmm1
Marinus and friends,
Uasm is complaining of :
;comiss xmm1, FLT4(1.0)
Main.asm(281) : Error A2273: real or BCD number not allowed
It's the FLT4 macro. Any idea?
Try good ol' Masm32 FP4()
Or use it as a constant.
.const
fp1 real4 1.0
.code
comiss xmm1,fp1
FP4 is an "built-in" UASM macro
Quote from: aw27 on May 16, 2018, 06:17:53 PM
FP4 is an "built-in" UASM macro
No, it's from the Masm32 SDK (\masm32\macros\macros.asm):
; **********************************************************
; function style macros for direct insertion of data types *
; **********************************************************
FP4 MACRO value
LOCAL vname
.data
align 4
vname REAL4 value
.code
EXITM <vname>
ENDM
FP8 MACRO value
LOCAL vname
.data
align 4
vname REAL8 value
.code
EXITM <vname>
ENDM
FP10 MACRO value
LOCAL vname
.data
align 4
vname REAL10 value
.code
EXITM <vname>
ENDM
Usage:
include \masm32\include\masm32rt.inc
.code
start:
int 3
fld FP4(123.456)
exit
end start
Quote
No, it's from the Masm32 SDK (\masm32\macros\macros.asm):
When you have some time download UASM and read the uasm246_ext.pdf
Then try to make a project without using the "include \masm32\include\masm32rt.inc" (if you still remember how to do it of course).
To your surprise you will see that UASM can figure out what FP4 is.
Quote
No, it's from the Masm32 SDK (\masm32\macros\macros.asm):
When you have some time download UASM and read the uasm246_ext.pdf
Then try to make a project without using the "include \masm32\include\masm32rt.inc" (if you still remember how to do it of course).
To your surprise :dazzled: you will see that UASM can figure out what FP4 is.
Quote from: aw27 on May 16, 2018, 07:01:27 PM
read the uasm246_ext.pdf
You mean To read the manual? :biggrin:
Yeap, RTFM ("Read The Funtastic Manual") ;)
thanks everyone.
So, how about FLT4, where is this macro from?
You can find it in dx9macros.inc ( part of my direct3d9 sources )
FLT4 MACRO float_number:REQ
LOCAL float_num
.data
align 4
float_num real4 float_number
.code
EXITM <float_num>
ENDM
FLT8 MACRO float_number:REQ
LOCAL float_num
.data
align 8
float_num real8 float_number
.code
EXITM <float_num>
ENDM
Out of curiosity: was it just to have a customized macro name?
Not really. I don't use the masm32rt.inc or the masm32\macros\macros.asm in my sources.
FP4 is not a masm standard. Don't know who came up with this kind of macro first.
I'm using the FLT4 and FLT8 macros for +/- 20 years now and they are properly aligned.
FP4 is aligned, too:
FP4 MACRO value
LOCAL vname
.data
align 4
vname REAL4 value
.code
EXITM <vname>
ENDM
I find this debate a bit academic. Without the Masm32 SDK, nobody would even know that MASM exists, or that writing Windows programs in Assembler is possible 8)
Certainly, this snippet works with UAsm. But it doesn't work with Masm. If, however, you enable the macros.asm line, it assembles with both. So what exactly is the added value of built-in FP? macros in UAsm?
.486 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\kernel32.inc
include \masm32\include\msvcrt.inc
; include \masm32\macros\macros.asm ; assembles with MASM and UAsm
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\msvcrt.lib
.code
txFormat db "A double: %1.15f", 0
start:
mov edi, offset FP8(0.0)
fldpi
fstp REAL8 PTR [edi]
invoke crt_printf, addr txFormat, REAL8 PTR [edi]
invoke ExitProcess, 0
end start
QuoteFP4 is aligned, too:
True, but FP8 isn't.
FP8 MACRO value
LOCAL vname
.data
align 4 <---- shouldn't this be align 8 ?
vname REAL8 value
.code
EXITM <vname>
ENDM
QuoteI find this debate a bit academic. Without the Masm32 SDK, nobody would even know that MASM exists, or that writing Windows programs in Assembler is possible 8)
You are totally right, but the question was: "So, how about FLT4, where is this macro from?" ( reply #59 )
A problem is that we can't include the macros.inc without including the others.
It will not take much effort to start finding nuisances. For example in JJ's carefully chosen example it is enough to change the calling convention to PASCAL to break the whole.
While in UASM we can simply declare our own prototypes and use the built-in FP4/FP8 macros.
Quote from: Siekmanski on May 17, 2018, 11:09:15 PM
QuoteFP4 is aligned, too:
True, but FP8 isn't.
...
align 4 <---- shouldn't this be align 8 ?
Alignment is overrated IMHO 8)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
465 cycles for 100 * align8
484 cycles for 100 * align4
487 cycles for 100 * misaligned
500 cycles for 100 * align8
494 cycles for 100 * align4
464 cycles for 100 * misaligned
It seems it is, but can we trust cycle counters on modern PC's? :biggrin:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
561 cycles for 100 * align8
566 cycles for 100 * align4
565 cycles for 100 * misaligned
567 cycles for 100 * align8
564 cycles for 100 * align4
559 cycles for 100 * misaligned
571 cycles for 100 * align8
567 cycles for 100 * align4
559 cycles for 100 * misaligned
562 cycles for 100 * align8
562 cycles for 100 * align4
559 cycles for 100 * misaligned
563 cycles for 100 * align8
562 cycles for 100 * align4
564 cycles for 100 * misaligned
12 bytes for align8
12 bytes for align4
12 bytes for misaligned
--- ok ---
you also have UASM and MASM macro for substitute several ugly messy
mov eax,immediate integers
movd (x)mm0,eax
with some MOVD (x)mm0,immediate integer macro?
also nice with 64bit and 128bit etc macros
Quote from: daydreamer on May 18, 2018, 03:43:21 AM
you also have UASM and MASM macro for substitute several ugly messy
mov eax,immediate integers
movd (x)mm0,eax
with some MOVD (x)mm0,immediate integer macro?
also nice with 64bit and 128bit etc macros
No problem:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
movx MACRO xmmArg, immArg
if (opattr immArg) ne 36 ; atImmediate
.err <** needs an immediate arg **>
endif
push immArg
movd xmmArg, dword ptr [esp]
add esp, 4
ENDM
Init
movx xmm2, 12345678h
deb 1, "Result:", x:xmm2
EndOfCodeDoesn't trash eax, and works with ordinary non-MasmBasic code, too.
Always making noise!!AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
759 cycles for 100 * align8
1628 cycles for 100 * align4
755 cycles for 100 * misaligned
760 cycles for 100 * align8
1642 cycles for 100 * align4
755 cycles for 100 * misaligned
760 cycles for 100 * align8
1629 cycles for 100 * align4
764 cycles for 100 * misaligned
761 cycles for 100 * align8
1629 cycles for 100 * align4
755 cycles for 100 * misaligned
765 cycles for 100 * align8
1642 cycles for 100 * align4
755 cycles for 100 * misaligned
12 bytes for align8
12 bytes for align4
12 bytes for misaligned
--- ok ---
I don't know nothing about processor's architecture, but I think that AMD FPU is a RISC chip. In RISC processors alignment apparently is critical.
Look like Assembler have 8 aligned qwords by default. not that :biggrin: Again, what happen here?
Win 10 Home, 64 bit 1.6 Ghz
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
962 cycles for 100 * align8
1023 cycles for 100 * align4
939 cycles for 100 * misaligned
1003 cycles for 100 * align8
1020 cycles for 100 * align4
979 cycles for 100 * misaligned
991 cycles for 100 * align8
1047 cycles for 100 * align4
1012 cycles for 100 * misaligned
939 cycles for 100 * align8
1104 cycles for 100 * align4
948 cycles for 100 * misaligned
949 cycles for 100 * align8
1064 cycles for 100 * align4
948 cycles for 100 * misaligned
12 bytes for align8
12 bytes for align4
12 bytes for misaligned
a little while later...
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
877 cycles for 100 * align8
943 cycles for 100 * align4
877 cycles for 100 * misaligned
947 cycles for 100 * align8
943 cycles for 100 * align4
878 cycles for 100 * misaligned
877 cycles for 100 * align8
1042 cycles for 100 * align4
878 cycles for 100 * misaligned
876 cycles for 100 * align8
1035 cycles for 100 * align4
875 cycles for 100 * misaligned
876 cycles for 100 * align8
1064 cycles for 100 * align4
883 cycles for 100 * misaligned
12 bytes for align8
12 bytes for align4
12 bytes for misaligned
this computer doesn't seem to like align 4
HSE's really doesn't like it. :P
These benchmarks are sometime rather interesting, since many times we don't get any common ground conclusion.
But aligning is so simple that I don't mind doing it anyway
Quote from: LordAdef on May 18, 2018, 07:49:34 AMBut aligning is so simple that I don't mind doing it anyway
If the code gets any faster with alignment, it makes sense in an innermost loop with a Million iterations. Otherwise it bloats your exe, pollutes the data cache, and thus may slow down the whole program.
Quote from: jj2007 on May 18, 2018, 10:14:54 AM
Quote from: LordAdef on May 18, 2018, 07:49:34 AMBut aligning is so simple that I don't mind doing it anyway
If the code gets any faster with alignment, it makes sense in an innermost loop with a Million iterations. Otherwise it bloats your exe, pollutes the data cache, and thus may slow down the whole program.
Thanks for macro jj
Thanks for a timing test idea:
Align 16 data with sse code,so you easily can use mulps,divps etc with variables in memory
Vs you are forced to not be able to use memory aligned data with simd,so instead you use lots of movups before innerloop and innerloop makes use of all 16 xmm regs in 64bit mode for all mulps etc is reg to reg,all variables are kept in .xmm regs
And testrun this several million times
Note that your testcase is different: fld alignedVariable vs fld unalignedVariable uses identical instructions. When using SIMD instructions that throw exceptions, you need additional instructions, and that may cost cycles, of course. Test it... GetTickCount is your friend ;)
Quote from: jj2007 on May 18, 2018, 09:48:28 PM
Note that your testcase is different: fld alignedVariable vs fld unalignedVariable uses identical instructions. When using SIMD instructions that throw exceptions, you need additional instructions, and that may cost cycles, of course. Test it... GetTickCount is your friend ;)
my C++ exercise, force me into use movups and mulps xmmreg,xmmreg,inline asm,so I could as well try that different solution for innerloop