Hi
what speed difference are there between test,cmp,fpu compare,SSE UCOMISS,SSE2 UCOMISD ?
Count for each:
.486p
.model flat, stdcall
option casemap:none
include C:\masm32\include\kernel32.inc
includelib C:\masm32\lib\kernel32.lib
.data?
PerformanceFrequency dword ?
dword ?
PerformanceCount1 dword ?
dword ?
PerformanceCount2 dword ?
dword ?
TimeSpent dword ?
.code
Start:
invoke QueryPerformanceFrequency, offset PerformanceFrequency ; Ticks per second frequency
invoke QueryPerformanceCounter, offset PerformanceCount1 ; Execution start time
mov ecx, 1000000000
CountTime:
test eax, 5
loop CountTime
invoke QueryPerformanceCounter, offset PerformanceCount2 ; Execution end time
mov eax, PerformanceCount2
sub eax, PerformanceCount1 ; Number of ticks elapsed
mov edx, 1000000 ; Converting ticks to microseconds
mul edx
div PerformanceFrequency
mov TimeSpent, eax ; Execution time in microseconds
invoke ExitProcess, 0
end Start
I think the OP was asking for timings of these instructions, not code to measure them.
Quote from: NoCforMe on February 11, 2025, 08:53:52 AMI think the OP was asking for timings of these instructions, not code to measure them.
I think that was a hint for daydreamer to time them for himself. I think... :cool:
I, of course, could be mistaken. That's happened once before. :tongue:
Why should you have to write a program to find these timings when they're probably published somewhere already? It's not like they're some kind of Top Sekrit info.
Quote from: NoCforMe on February 11, 2025, 09:11:54 AMWhy should you have to write a program to find these timings when they're probably published somewhere already? It's not like they're some kind of Top Sekrit info.
Because timings
can differ wildly from one processor to another, and definitely between AMD vs. Intel for some instructions. Same holds true for cycle counts as well, judging by some of the threads in this very board (The Laboratory).
Have a look at masm32\macros\timers.asm
You can count clock cycles and execution time with the macros.
They aren't in the MASM64 package though, but I think MichaelW ported them to 64-bit somewhere.
Macros, schmacros.
The code given above by Villuy shows how simple this is:
- Call QueryPerformanceCounter(), stash the start time
- Run the operation you want to time
- Call QueryPerformanceCounter() again, subtract the start time from this time
- Scale the result using the value returned by QueryPerformanceFrequency(), display the result
counter_begin 100000,HIGH_PRIORITY_CLASS
;your code here
counter_end
;EAX has cycle count
:rolleyes:
To answer the OP's question without writing code, you might want to check this document of Agner Fog's (https://www.agner.org/optimize/instruction_tables.pdf), which gives the following info for each instruction for many different processors:
- Ops ("Number of macro-operations issued from instruction decoder to schedulers")
- Latency: ("This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably.")
- Reciprocal throughput ("This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute.")
As you can see, determining execution time for instructions is no simple thing. However, these tables at least offer the chance to compare instructions to see which ones may be relatively faster.
Of course, as they say, YMMV.
In fact, if instruction is needed in this place, then it is needed. No point in thinking about its price. But in broad sense, optimization is so complex and unknowable that there is nothing to rely on except empirical method for each concrete case.
Real4 compares that can be done with integer 32 bit cmp is +inf ,-inf and zero 0.0 = zero 0
Real8 compares similar but with 64 bit integer cmp
Are these faster ?
Ucomiss faster than fpu compare ?
Quote from: daydreamer on February 12, 2025, 06:15:21 AMReal4 compares that can be done with integer 32 bit cmp is +inf ,-inf and zero 0.0 = zero 0
Real8 compares similar but with 64 bit integer cmp
Are these faster ?
Ucomiss faster than fpu compare ?
Look them up in that document I linked in my reply above (https://www.agner.org/optimize/instruction_tables.pdf).
They're all there. See for yourself.
What's happened to this forum ?,nobody wants to test run code anymore ? :(
I have a comparison that might not been tested yet :
A loop with many scalar compares with conditional jumps vs packed compares ???
Well, go ahead and test it then.
You have access to all the tools you need, including macros if you want to go that route.
Report back to us with your results.
Quote from: daydreamer on February 13, 2025, 07:43:34 AMWhat's happened to this forum ?,nobody wants to test run code anymore ? :(
I have a comparison that might not been tested yet :
A loop with many scalar compares with conditional jumps vs packed compares ???
The skeleton code I wrote earlier is all you need, just add the code you want to test.
Quote from: daydreamer on February 13, 2025, 07:43:34 AMWhat's happened to this forum ?,nobody wants to test run code anymore ? :(
It's usually the OP that supplies the testing method. This will ensure that all testers will use the same testbed (I.e., testing methods), for better comparison between processors where the test is being conducted, and even to test for any differences between OS's etc. that are testing the same function or algorithm.
We look forward to your test results, hopefully soon.
Scalar vs packed SSE compare
// primesx.cpp
#include "pch.h"
#include <iostream>
using namespace std;
float zero = 0.0;
int pflag = 0;
alignas(16) float flut[]{ 2.0,3.0,5.0,7.0,11.0,13.0,17.0,19.0,0.0,0.0,0.0 };
alignas(16) float arr[]{ 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 };
char lut[]{ 0, 0, 2, 3, 0, 5, 0, 7, 0, 0,0,11,0,13,0,0,0,0,0,0 };
int main()
{
int i, j=3;
float f = 3.0;
float fresult = 0;
cout << "Primesx\n";
for (i = 0; i < 14; i++) {
cout << i << " ";
if (lut[i] != 0) cout << "prime " << (int)lut[i] << " ";
cout << f << " ";
//f = f + 1.0;
}
cout << f << " ";
cout << "\n\n\n\n";
for (j = 1; j < 14; j=j+1) {
f = (float)j; //test all floats
_asm {
push ebx
mov ecx, 14
mov ebx, 1
; lea ebx, [ebx * 4]
lea ebx, [flut + ebx * 4]
; lea ebx, flut
L2 :
movss xmm0, f
movss xmm1, [ebx]
; subss xmm0, xmm1
ucomiss xmm0, xmm1
jne L1; found prime
ja L4s;ja jump if above,jb jump if below
mov eax, 1
mov pflag, eax
movss xmm0, [ebx]
movss fresult, xmm0
jmp l3;have found prime, jump out of loop
L4s:
mov eax,0ffffffffh
jmp L3
L1 :
xorps xmm0, xmm0
mov eax,0
mov pflag,eax
; movss fresult, xmm0
add ebx, 4
dec ecx
jne L2
L3:
pop ebx
}
fresult = fresult * pflag;
cout << fresult << " ";
}//j
cout << "\n";
for (j = 2; j < 20; j++) {
f = (float)j;
_asm {
push ebx
lea ebx, flut
movss xmm0, f
shufps xmm0, xmm0, 0
movups xmm1, [ebx]
movaps xmm3, xmm1
CMPEQPS xmm1, xmm0
pand xmm1, xmm3
movaps xmm7, xmm1
add ebx, 16
movups xmm1, [ebx]
movaps xmm3, xmm1
CMPEQPS xmm1, xmm0
pand xmm1, xmm3
por xmm1, xmm7
movups arr, xmm1
haddps xmm1, xmm1
haddps xmm1, xmm1
movss fresult, xmm1
pop ebx
}
cout << "xmm reg float 0,float 1,float 2,float 3 : "<< arr[0] << " " << arr[1] << " " << arr[2] << " " << arr[3] << "\n";
cout << fresult << " zero = non prime\n";
}//second time j
}
where is "pch.h" ?
never mind
I don't see any results, only C code.
Quote from: NoCforMe on February 17, 2025, 06:16:19 AMI don't see any results, only C code.
C++ code with part of assembler block.
C and C++ aren't same thing.
I don't use C++, as i don't need it, and as it isn't universal at all, as it can't share object files nor libraries.