Print Page - different compares clock cycle test ???

Title: different compares clock cycle test ???
Post by: daydreamer on February 11, 2025, 04:23:36 AM

Hi
what speed difference are there between test,cmp,fpu compare,SSE UCOMISS,SSE2 UCOMISD ?

Title: Re: different compares clock cycle test ???
Post by: Villuy on February 11, 2025, 07:12:12 AM

Count for each:

.486p
.model flat, stdcall
option casemap:none

include C:\masm32\include\kernel32.inc
includelib C:\masm32\lib\kernel32.lib

.data?

PerformanceFrequency dword ?
dword ?
PerformanceCount1 dword ?
dword ?
PerformanceCount2 dword ?
dword ?

TimeSpent dword ?

.code

Start:

invoke QueryPerformanceFrequency, offset PerformanceFrequency ; Ticks per second frequency
invoke QueryPerformanceCounter, offset PerformanceCount1 ; Execution start time
mov ecx, 1000000000

CountTime:

test eax, 5
loop CountTime

invoke QueryPerformanceCounter, offset PerformanceCount2 ; Execution end time
mov eax, PerformanceCount2
sub eax, PerformanceCount1 ; Number of ticks elapsed
mov edx, 1000000 ; Converting ticks to microseconds
mul edx
div PerformanceFrequency
mov TimeSpent, eax ; Execution time in microseconds

invoke ExitProcess, 0

end Start

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 11, 2025, 08:53:52 AM

I think the OP was asking for timings of these instructions, not code to measure them.

Title: Re: different compares clock cycle test ???
Post by: zedd on February 11, 2025, 09:02:33 AM

Quote from: NoCforMe on February 11, 2025, 08:53:52 AMI think the OP was asking for timings of these instructions, not code to measure them.

I think that was a hint for daydreamer to time them for himself. I think... :cool:
I, of course, could be mistaken. That's happened once before. :tongue:

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 11, 2025, 09:11:54 AM

Why should you have to write a program to find these timings when they're probably published somewhere already? It's not like they're some kind of Top Sekrit info.

Title: Re: different compares clock cycle test ???
Post by: zedd on February 11, 2025, 09:33:23 AM

Quote from: NoCforMe on February 11, 2025, 09:11:54 AMWhy should you have to write a program to find these timings when they're probably published somewhere already? It's not like they're some kind of Top Sekrit info.

Because timings can differ wildly from one processor to another, and definitely between AMD vs. Intel for some instructions. Same holds true for cycle counts as well, judging by some of the threads in this very board (The Laboratory).

Title: Re: different compares clock cycle test ???
Post by: sinsi on February 11, 2025, 11:23:30 AM

Have a look at masm32\macros\timers.asm
You can count clock cycles and execution time with the macros.
They aren't in the MASM64 package though, but I think MichaelW ported them to 64-bit somewhere.

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 11, 2025, 01:30:16 PM

Macros, schmacros.
The code given above by Villuy shows how simple this is:

Call QueryPerformanceCounter(), stash the start time
Run the operation you want to time
Call QueryPerformanceCounter() again, subtract the start time from this time
Scale the result using the value returned by QueryPerformanceFrequency(), display the result

Title: Re: different compares clock cycle test ???
Post by: sinsi on February 11, 2025, 02:11:38 PM

Code Select

counter_begin 100000,HIGH_PRIORITY_CLASS
;your code here
counter_end
;EAX has cycle count

:rolleyes:

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 11, 2025, 03:27:36 PM

To answer the OP's question without writing code, you might want to check this document of Agner Fog's (https://www.agner.org/optimize/instruction_tables.pdf), which gives the following info for each instruction for many different processors:

Ops ("Number of macro-operations issued from instruction decoder to schedulers")
Latency: ("This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably.")
Reciprocal throughput ("This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute.")

As you can see, determining execution time for instructions is no simple thing. However, these tables at least offer the chance to compare instructions to see which ones may be relatively faster.
Of course, as they say, YMMV.

Title: Re: different compares clock cycle test ???
Post by: Villuy on February 11, 2025, 05:50:31 PM

In fact, if instruction is needed in this place, then it is needed. No point in thinking about its price. But in broad sense, optimization is so complex and unknowable that there is nothing to rely on except empirical method for each concrete case.

Title: Re: different compares clock cycle test ???
Post by: daydreamer on February 12, 2025, 06:15:21 AM

Real4 compares that can be done with integer 32 bit cmp is +inf ,-inf and zero 0.0 = zero 0
Real8 compares similar but with 64 bit integer cmp
Are these faster ?
Ucomiss faster than fpu compare ?

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 12, 2025, 09:41:57 AM

Quote from: daydreamer on February 12, 2025, 06:15:21 AMReal4 compares that can be done with integer 32 bit cmp is +inf ,-inf and zero 0.0 = zero 0
Real8 compares similar but with 64 bit integer cmp
Are these faster ?
Ucomiss faster than fpu compare ?

Look them up in that document I linked in my reply above (https://www.agner.org/optimize/instruction_tables.pdf).
They're all there. See for yourself.

Title: Re: different compares clock cycle test ???
Post by: daydreamer on February 13, 2025, 07:43:34 AM

What's happened to this forum ?,nobody wants to test run code anymore ? :(
I have a comparison that might not been tested yet :
A loop with many scalar compares with conditional jumps vs packed compares ???

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 13, 2025, 08:03:24 AM

Well, go ahead and test it then.
You have access to all the tools you need, including macros if you want to go that route.
Report back to us with your results.

Title: Re: different compares clock cycle test ???
Post by: sinsi on February 13, 2025, 09:08:55 AM

Quote from: daydreamer on February 13, 2025, 07:43:34 AMWhat's happened to this forum ?,nobody wants to test run code anymore ? :(
I have a comparison that might not been tested yet :
A loop with many scalar compares with conditional jumps vs packed compares ???

The skeleton code I wrote earlier is all you need, just add the code you want to test.

Title: Re: different compares clock cycle test ???
Post by: zedd on February 13, 2025, 09:32:36 AM

Quote from: daydreamer on February 13, 2025, 07:43:34 AMWhat's happened to this forum ?,nobody wants to test run code anymore ? :(

It's usually the OP that supplies the testing method. This will ensure that all testers will use the same testbed (I.e., testing methods), for better comparison between processors where the test is being conducted, and even to test for any differences between OS's etc. that are testing the same function or algorithm.

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 13, 2025, 12:06:45 PM

We look forward to your test results, hopefully soon.

Title: Re: different compares clock cycle test ???
Post by: daydreamer on February 16, 2025, 09:35:19 PM

Scalar vs packed SSE compare

Code Select

// primesx.cpp

#include "pch.h"
#include <iostream>
using namespace std;
float zero = 0.0;
int pflag = 0;
alignas(16) float flut[]{ 2.0,3.0,5.0,7.0,11.0,13.0,17.0,19.0,0.0,0.0,0.0 };
alignas(16) float arr[]{ 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 };
char lut[]{ 0, 0, 2, 3, 0, 5, 0, 7, 0, 0,0,11,0,13,0,0,0,0,0,0 };
int main()
{
	int i, j=3;
	float f = 3.0;
	float fresult = 0;
    cout << "Primesx\n";
	for (i = 0; i < 14; i++) {
		cout << i << " ";
		if (lut[i] != 0) cout << "prime " << (int)lut[i] << " ";
		cout << f << " ";
		//f = f + 1.0;
	}
	cout << f << " ";
	cout << "\n\n\n\n";
	for (j = 1; j < 14; j=j+1) {
		f = (float)j; //test all floats
		_asm {
			push ebx

			mov ecx, 14
			mov ebx, 1
			; lea ebx, [ebx * 4]
			lea ebx, [flut + ebx * 4]
			; lea ebx, flut
			L2 :
			movss xmm0, f
				movss xmm1, [ebx]
				; subss xmm0, xmm1
				ucomiss xmm0, xmm1
				jne L1; found prime
				ja L4s;ja jump if above,jb jump if below
				mov eax, 1
				mov pflag, eax
				movss xmm0, [ebx]
				movss fresult, xmm0
				jmp l3;have found prime, jump out of loop
				L4s:
			mov eax,0ffffffffh
				jmp L3
				L1 :
			xorps xmm0, xmm0
				mov eax,0
				mov pflag,eax
				; movss fresult, xmm0
				add ebx, 4
				dec ecx
				jne L2
				L3:
				pop ebx


		}
		fresult = fresult * pflag;
		cout << fresult << " ";
	}//j
		cout << "\n";
		for (j = 2; j < 20; j++) {
			f = (float)j;	

			_asm {
				push ebx
				lea ebx, flut
				movss xmm0, f
				shufps xmm0, xmm0, 0
				movups xmm1, [ebx]
				movaps xmm3, xmm1
				CMPEQPS xmm1, xmm0
				pand xmm1, xmm3
				movaps xmm7, xmm1
				add ebx, 16
				movups xmm1, [ebx]
				movaps xmm3, xmm1
				CMPEQPS xmm1, xmm0
				pand xmm1, xmm3
				por xmm1, xmm7
				movups arr, xmm1
				haddps xmm1, xmm1
				haddps xmm1, xmm1
				movss fresult, xmm1


				pop ebx
			}
			cout << "xmm reg float 0,float 1,float 2,float 3 : "<< arr[0] << " " << arr[1] << " " << arr[2] << " " << arr[3] << "\n";
			cout << fresult << " zero = non prime\n";

		}//second time j
	
	
}

Title: Re: different compares clock cycle test ???
Post by: jack on February 17, 2025, 12:30:29 AM

~~where is "pch.h" ?~~
never mind

Title: Re: different compares clock cycle test ???
Post by: NoCforMe on February 17, 2025, 06:16:19 AM

I don't see any results, only C code.

Title: Re: different compares clock cycle test ???
Post by: TimoVJL on February 17, 2025, 06:49:24 AM

Quote from: NoCforMe on February 17, 2025, 06:16:19 AMI don't see any results, only C code.

C++ code with part of assembler block.
C and C++ aren't same thing.

I don't use C++, as i don't need it, and as it isn't universal at all, as it can't share object files nor libraries.

The MASM Forum

General => The Laboratory => Topic started by: daydreamer on February 11, 2025, 04:23:36 AM