Not much testing, and not enough time spent on the 64-bit technicalities, but I think the code is at least mostly correct and it appears to work as I intended. I used JWasm v2.12pre, WinInc V2.08, and Pelles Linker Version 8.00.1. The attachment includes the source files and EXE, and here are the relevant commands from my SciTE asm.properties so you can see what command lines I used:
command.name.8.*.asm=Jwasm64
command.8.*.asm=c:\jwasm\bin\jwasm -win64 -W3 -c $(FileNameExt)
command.name.10.*.asm=Polink64
command.10.*.asm=c:\program files\pellesc\bin\polink /SUBSYSTEM:CONSOLE /MACHINE:x64 $(FileName).obj
And some typical results running on a Core2-i3 G3220 under Windows7-64:
75 cycles, movsd*250
76 cycles, movsq*125
73 cycles, movsd*250
75 cycles, movsq*125
75 cycles, movsd*250
75 cycles, movsq*125
15 cycles, 32-bit instructons * 10
32 cycles, fpu instructions * 10
5 cycles, 64-bit instructions * 10
14 cycles, 32-bit instructons * 10
32 cycles, fpu instructions * 10
5 cycles, 64-bit instructions * 10
14 cycles, 32-bit instructons * 10
32 cycles, fpu instructions * 10
5 cycles, 64-bit instructions * 10
Edit: New version in the attachment, with the (not Windows8 compatible) _getch code replaced with a procedure derived from the MASM32 wait_key procedure.
Windows 8.1 x64, i7 3770K
3
3
3
107 cycles, movsd*250
104 cycles, movsq*125
108 cycles, movsd*250
103 cycles, movsq*125
110 cycles, movsd*250
105 cycles, movsq*125
22 cycles, 32-bit instructons * 10
30 cycles, fpu instructions * 10
8 cycles, 64-bit instructions * 10
22 cycles, 32-bit instructons * 10
30 cycles, fpu instructions * 10
7 cycles, 64-bit instructions * 10
21 cycles, 32-bit instructons * 10
30 cycles, fpu instructions * 10
8 cycles, 64-bit instructions * 10
Faulting application name: test.exe, version: 0.0.0.0, time stamp: 0x542a292d
Faulting module name: ntdll.dll, version: 6.3.9600.17114, time stamp: 0x53649e73
Exception code: 0xc0000005
Fault offset: 0x000000000003a027
When did it fault? If it was at the end it could be a problem with _getch. It works under Windows7-64, but:
http://msdn.microsoft.com/en-us/library/078sfkak.aspx
The attachment contains a version the exits after a 5-second delay, with no call to _getch.
Latest one works OK. Here's what windbg said about the first one
(1ce0.1bcc): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlDosPathNameToRelativeNtPathName+0x237:
00007ff9`e634a027 0f28442440 movaps xmm0,xmmword ptr [rsp+40h] ss:00000000`0013f978=000000000013f9b80000000002080014
Michael,
heree are the results for the second program:
3
3
3
147 cycles, movsd*250
99 cycles, movsq*125
104 cycles, movsd*250
99 cycles, movsq*125
104 cycles, movsd*250
99 cycles, movsq*125
20 cycles, 32-bit instructons * 10
28 cycles, fpu instructions * 10
7 cycles, 64-bit instructions * 10
20 cycles, 32-bit instructons * 10
28 cycles, fpu instructions * 10
7 cycles, 64-bit instructions * 10
20 cycles, 32-bit instructons * 10
29 cycles, fpu instructions * 10
7 cycles, 64-bit instructions * 10
Gunther
Windows 8.1 x64 i7-4930K
3
3
3
176 cycles, movsd*250
113 cycles, movsq*125
120 cycles, movsd*250
114 cycles, movsq*125
119 cycles, movsd*250
114 cycles, movsq*125
24 cycles, 32-bit instructons * 10
34 cycles, fpu instructions * 10
9 cycles, 64-bit instructions * 10
23 cycles, 32-bit instructons * 10
33 cycles, fpu instructions * 10
9 cycles, 64-bit instructions * 10
23 cycles, 32-bit instructons * 10
33 cycles, fpu instructions * 10
9 cycles, 64-bit instructions * 10
Michael,
the OS is Windows 7-64. I did Forget that Information in my last post.
Gunther
Second zip file on Win7 64 bit.
3
3
3
107 cycles, movsd*250
95 cycles, movsq*125
107 cycles, movsd*250
95 cycles, movsq*125
107 cycles, movsd*250
96 cycles, movsq*125
33 cycles, 32-bit instructons * 10
32 cycles, fpu instructions * 10
14 cycles, 64-bit instructions * 10
33 cycles, 32-bit instructons * 10
32 cycles, fpu instructions * 10
14 cycles, 64-bit instructions * 10
33 cycles, 32-bit instructons * 10
32 cycles, fpu instructions * 10
14 cycles, 64-bit instructions * 10