Just for fun, a speed test dynamic (LoadLibrary+GetProcAddress) vs static linking (*.lib).
The guinea pig is GetTickCount itself, simply because it's the only sufficiently short and fast API around.
The test uses
a) ordinary MasmBasic with static linking
b) the dual assembly variant JBasic, which uses a stub that loads all addresses at runtime.
The only real difference is that the static version uses a jump table, while the dynamic version calls GetTickCount directly. The difference is remarkable, though: 546 to 469 ticks.
Results:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Assembled with ML in 32-bit format using MasmBasic
100000000 iterations
562 ticks
546 ticks
530 ticks
562 ticks
546 ticks
577 ticks
546 ticks
546 ticks
546 ticks
546 ticks
Assembled with ML in 32-bit format using JBasic
100000000 iterations
484 ticks
483 ticks
468 ticks
469 ticks
483 ticks
468 ticks
468 ticks
484 ticks
483 ticks
468 ticks
For comparison, the 64-bit version:Assembled with ml64 in 64-bit format using JBasic
100000000 iterations
343 ticks
328 ticks
343 ticks
343 ticks
343 ticks
344 ticks
343 ticks
343 ticks
343 ticks
343 ticks
P.S.: Just in case you feel confirmed that 64-bit code is faster...:
000007FEFCED1120 | 8B 0C 25 04 00 FE 7F | mov ecx, dword ptr ds:[7FFE0004] |
000007FEFCED1127 | 48 8B 04 25 20 03 FE 7F | mov rax, qword ptr ds:[7FFE0320] |
000007FEFCED112F | 48 0F AF C1 | imul rax, rcx |
000007FEFCED1133 | 48 C1 E8 18 | shr rax, 18 |
000007FEFCED1137 | C3 | ret |
GetTickCount /$ /EB 02 jmp short 768B8FD8
768B8FD6 |> |F3: prefix rep:
768B8FD7 |. |90 nop
768B8FD8 |> \8B0D 2403FE7F mov ecx, [7FFE0324]
768B8FDE |. 8B15 2003FE7F mov edx, [7FFE0320]
768B8FE4 |. A1 2803FE7F mov eax, [7FFE0328]
768B8FE9 |. 3BC8 cmp ecx, eax
768B8FEB |.^ 75 E9 jnz short 768B8FD6
768B8FED |. A1 0400FE7F mov eax, [7FFE0004]
768B8FF2 |. F7E2 mul edx
768B8FF4 |. C1E1 08 shl ecx, 8
768B8FF7 |. 0FAF0D 0400FE7F imul ecx, [7FFE0004]
768B8FFE |. 0FACD0 18 shrd eax, edx, 18
768B9002 |. C1EA 18 shr edx, 18
768B9005 |. 03C1 add eax, ecx
768B9007 \. C3 retn
The 32-bit version is much longer, it's a miracle that it doesn't take twice as long :bgrin:
Nice work and thanks! :icon14:
So, the code it's the function GetTickCount from the api in both 32 and 64 bits versions? :redface:
Quote from: felipe on April 10, 2017, 02:04:57 AM
Nice work and thanks! :icon14:
So, the code it's the function GetTickCount from the api in both 32 and 64 bits versions? :redface:
Yes, as you can see from the addresses, e.g. 000007FEFCED1120 - that's 64 bits wide and in the kernel.
That's great, thanks a lot! :biggrin:
Tks for the test, JJ. That´s what i thought. Static linking is faster than using it dynamically.
Results:
Static
AMD Ryzen 5 2400G with Radeon Vega Graphics
Assembled with ML in 32-bit format using MasmBasic
100000000 iterations
390 ticks
375 ticks
406 ticks
375 ticks
375 ticks
360 ticks
375 ticks
375 ticks
390 ticks
391 ticks
Dinamic
Assembled with ML in 32-bit format using JBasic
100000000 iterations
438 ticks
453 ticks
438 ticks
437 ticks
422 ticks
469 ticks
437 ticks
422 ticks
438 ticks
421 ticks
In import section OS create IAT for imported functions.
So use proper import library for dlls and stop wasting program user time with so called clever code with own dynamic dll loading, that is useful, when dll is only used, when they are needed, like delayed linking.
Quote from: TimoVJL on April 03, 2025, 07:53:28 AMSo use proper import library for dlls and stop wasting program user time with so called clever code with own dynamic dll loading...
If done properly, it is transparent to the program user. Why would they care?
And it works. What is wrong with using this method?
qeditor.exe (The editor for masm32 SDK) has been using this method to load user plugins since at least "9 October, 2008" when the qeditor.chm help file (in the masm32 SDK) for qeditor version 4 was written - with no issues.
The names of the user plugins of course were not yet known during the assembly of the program, but supplied to qeditor in "menus.ini" by the user and loaded by qeditor using LoadLibrary and GetProcAddress. :smiley:
This topic is
8 years old btw, nothing new. :wink2:
Also, this test is posted in the Laboratory to test the speed difference, not as a recommendation for every day usage. The first line in the topic states "Just for fun".
You have waited 8 years to share your opinion? :joking: You must have thought that this topic was created just today. :smiley:
Quote from: TimoVJL on April 03, 2025, 07:53:28 AMIn import section OS create IAT for imported functions.
So use proper import library for dlls and stop wasting program user time with so called clever code with own dynamic dll loading, that is useful, when dll is only used, when they are needed, like delayed linking.
Me ??? I rarely use dynamic dlls (with GetProcAddress etc). I tested it now due to the other thread we were talking about here (https://masm32.com/board/index.php?topic=12674.msg137802#msg137802)
I don't think he was addressing you, guga.
You were just submitting your test results - although, a bit late it seems. :biggrin:
Here's mine. :tongue: Better late than never. :biggrin:
Dynamic 32 bit
Assembled with ML in 32-bit format using JBasic
100000000 iterations
265 ticks
266 ticks
266 ticks
265 ticks
266 ticks
266 ticks
265 ticks
266 ticks
265 ticks
266 ticks
Static 32 bit
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
Assembled with ML in 32-bit format using MasmBasic
100000000 iterations
282 ticks
265 ticks
266 ticks
281 ticks
281 ticks
266 ticks
266 ticks
281 ticks
281 ticks
266 ticks
Dynamic 64 bit
Assembled with HJWasm32 in 64-bit format using JBasic
100000000 iterations
125 ticks
125 ticks
125 ticks
110 ticks
125 ticks
125 ticks
109 ticks
125 ticks
156 ticks
125 ticks
Both 32 bit versions about the same. 64 bit slightly more than twice as fast for me. :biggrin:
I could not run the three programs concurrently using a batch file, btw jj. I had to run them seperately. They did not return to run the next one.
Quote from: guga on April 03, 2025, 01:38:46 AMThat´s what i thought. Static linking is faster than using it dynamically
My tests said the opposite :cool:
In real life, results should be identical because you end up using identical kernel etc functions.
Quote from: jj2007 on April 04, 2025, 08:33:13 AMQuote from: guga on April 03, 2025, 01:38:46 AMThat´s what i thought. Static linking is faster than using it dynamically
My tests said the opposite :cool:
In real life, results should be identical because you end up using identical kernel etc functions.
Indeed, your tests are different than mine. I wonder why it happens. Perhaps AMD handles those a bit different ?
Test code in C#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
int __cdecl main(void)
{
int i;
DWORD dw1, dw2;
dw1 = GetTickCount();
for (i=0; i< 100000000; i++)
dw2 = GetTickCount();
printf("%d ticks static\n", dw2-dw1);
HMODULE hMod = LoadLibrary("kernel32.dll");
FARPROC pGetTickCount = GetProcAddress(hMod, "GetTickCount");
dw1 = GetTickCount();
for (i=0; i< 100000000; i++)
dw2 = pGetTickCount();
printf("%d ticks dynamic\n", dw2-dw1);
return 0;
}
32-bit640 ticks static
562 ticks dynamic
64-bit499 ticks static
422 ticks dynamic
516 ticks static
531 ticks dynamic
532 ticks static
531 ticks dynamic
547 ticks static
531 ticks dynamic
---- more iterations: ----
5312 ticks static
5422 ticks dynamic
5328 ticks static
5344 ticks dynamic
5438 ticks static
5312 ticks dynamic
5328 ticks static
5312 ticks dynamic
Quote from: TimoVJL on April 04, 2025, 07:28:26 PMTest code in C
32 bit:
480 ticks static
492 ticks dynamic
64 bit:
322 ticks static
317 ticks dynamic
Quote from: TimoVJL on April 03, 2025, 07:53:28 AMSo use proper import library for dlls and stop wasting program user time with so called clever code with own dynamic dll loading
irony? :biggrin:
Quote from: zedd151 on April 04, 2025, 10:41:43 PMirony? :biggrin:
but i at least really test things ?
Earlier i was thinking about those additional time to get function addresses from dll, LoadLibrary / GetProcAddress.
Those are useful many times, like to avoid dll presence at runtime.
:smiley: No FreeLibrary call?
That should be considered too, if calculating the extra time spent using this method, imo.
No need, as if someone know how things goes.
FreeLibrary() is only needed, if want to free some resources, if that a dll doesn't used anymore.
Quote from: TimoVJL on April 05, 2025, 12:17:00 AMNo need, as if someone know how things goes.
FreeLibrary() is only needed, if want to free some resources, if that a dll doesn't used anymore.
I thought it was worth a mention, in any case. :smiley:
Quote from: TimoVJL on April 04, 2025, 11:11:02 PMadditional time to get function addresses from dll, LoadLibrary / GetProcAddress
I timed that for my Windows GUI template: about 0.6 milliseconds, once at program start :cool:
Re FreeLibrary: only necessary if you need to free resources but keep the program running - a rather exotic requirement. A simple ExitProcess takes care of libraries.
Quote from: jj2007 on April 05, 2025, 02:58:40 AMRe FreeLibrary: only necessary if you need to free resources but keep the program running - a rather exotic requirement. A simple ExitProcess takes care of libraries.
Oh. That, I did not know. I thought that any libraries opened with LoadLibrary needed to be freed with FreeLibrary explicitly, before exiting. :toothy: It was only in special cases that I had ever used it. I.e., temporarily loading a plugin for one of my editors for example.
Raymond Chen on ExitProcess (https://masm32.com/board/index.php?topic=6053.0)
When I coded java,I tried java native interface and loaded. Dll and called c code, speed improvement was worth it,much faster than same java code
Pelles C 13 project
results are now after changing test order
TestDynStatic1:
1328 ticks dynamic
1328 ticks implib
TestDynStatic164:
672 ticks dynamic
657 ticks implib
for TestDynStatic1_workspace.zip
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
But then you changed the zip file....
TestDynStatic1.exe
1203 ticks dynamic
1328 ticks implib
1188 ticks dynamic
1328 ticks implib
1187 ticks dynamic
1313 ticks implib
1203 ticks dynamic
1313 ticks implib
1203 ticks dynamic
1312 ticks implib
TestDynStatic164.exe
672 ticks dynamic
609 ticks implib
610 ticks dynamic
625 ticks implib
625 ticks dynamic
640 ticks implib
625 ticks dynamic
594 ticks implib
625 ticks dynamic
625 ticks implib
for TestDynStatic1_workspace1.zip
So basically 6 of one, half a dozen of the other.
Meh ...
Quote from: NoCforMe on April 06, 2025, 08:09:55 AMSo basically 6 of one, half a dozen of the other.
Meh ...
But 12 for 64 bit version :tongue: (almost twice as fast)
I was referring to the difference between "dynamic" and "implib".
Quote from: NoCforMe on April 06, 2025, 08:19:17 AMI was referring to the difference between "dynamic" and "implib".
:biggrin: I know. :joking:
I thought native calls were supposed to be faster?
I ran my own tests.
NTGetTickCount doesn't exist so we must use NTQueryPerformanceCounter.
Cycles normal 35
Cycles dynamic 425
Cycles syscall 933
35 cycles is too fast to be a system call. Something is wrong.
EXTRN GetProcAddress:PROC
EXTRN GetModuleHandleA:PROC
EXTRN QueryPerformanceCounter:PROC
.data
ModuleName BYTE "Kernel32.dll",0
NTModuleName BYTE "Ntdll.dll",0
ProcName BYTE "QueryPerformanceCounter",0
NTProcName BYTE "NtQueryPerformanceCounter",0
ALIGN 16
Counter QWORD 1337
.code
TestNT proc
push rbx
push rdi
push rsi
sub rsp,32
lea rcx,NTModuleName
call GetModuleHandleA
mov rcx,rax
lea rdx,NTProcName
call GetProcAddress
mov esi,[rax+4]
lfence
mfence
rdtsc ;edx:eax
lfence
shl rdx,32
lea rdi,[rax+rdx]
mov ebx,16384
Start_NT:
mov eax,esi
lea r10,Counter
xor edx,edx
syscall ;(rax),r10,rdx,r8,r9
dec ebx
jnz Start_NT
lfence
mfence
rdtsc
lfence
shl rdx,32
or rax,rdx
sub rax,rdi
shr eax,14
add rsp,32
pop rsi
pop rdi
pop rbx
ret
TestNT endp
TestNative proc
push rsi
push rbx
push rdi
sub rsp,32
lea rcx,ModuleName
call GetModuleHandleA
test rax,rax
jz Error_End
mov rcx,rax
lea rdx,ProcName
call GetProcAddress
test rax,rax
jz Error_End
mov rsi,rax
lfence
mfence
rdtsc ;edx:eax
lfence
shl rdx,32
lea rdi,[rax+rdx]
mov ebx,16384
Start_NA:
lea rcx,Counter
call rsi ;->RAX
dec ebx
jnz Start_NA
lfence
mfence
rdtsc
lfence
shl rdx,32
or rax,rdx
sub rax,rdi
shr eax,14
Error_End:
add rsp,32
pop rdi
pop rbx
pop rsi
ret
TestNative endp
TestNormal proc
push rbx
push rdi
sub rsp,40
lfence
mfence
rdtsc ;edx:eax
lfence
shl rdx,32
lea rdi,[rax+rdx]
mov ebx,16384
Start_N:
lea rcx,Counter
call QueryPerformanceCounter ;->RAX
dec ebx
jnz Start_N
lfence
mfence
rdtsc
lfence
shl rdx,32
or rax,rdx
sub rax,rdi
shr eax,14
add rsp,40
pop rdi
pop rbx
ret
TestNormal endp
Quote from: InfiniteLoop on April 07, 2025, 12:22:28 PM35 cycles is too fast to be a system call. Something is wrong.
For results much lower than the others, maybe an issue with your testing methods?
Also, why not attach the executable for your tests, so others can use it. That is sort of standard practice here in the Laboratory for timing/cycle counting tests, afaics.
Just curious, why NTanything?
Ah yes, I made a mistake.
Would you trust an .exe from a random over the internet?
NT is the lowest level, the "native windows API". I assumed its faster and didn't even need to be tested.
Normal call 36 cycles.
Dynamic call 35 cycles.
NT syscall 252 cycles.
Quote from: InfiniteLoop on April 07, 2025, 03:51:34 PMWould you trust an .exe from a random over the internet?
From here, from a known member,
absolutely. Also, the file can be inspected, if from an unknown source like a brand new member, or an external (meaning not from an attachment on the forum) link.
QuoteNT is the lowest level, the "native windows API".
Ok. I had never bothered with syscalls.
QuoteI assumed its faster and didn't even need to be tested.
Normal call 105 cycles.
Dynamic call 118 cycles.
NT syscall 720 cycles.
Well, now you know. And so do we. :biggrin:
Pelles C project using that code
normal: 3925
dynamic: 2676
syscall: 3647
normal: 2833
dynamic: 3341
syscall: 2899
Quote from: TimoVJL on April 07, 2025, 04:00:16 PMPelles C project using that code
normal: 3925
dynamic: 2676
syscall: 3647
normal: 2833
dynamic: 3341
syscall: 2899
first run
normal: 82
dynamic: 46
syscall: 1976
another run
normal: 47
dynamic: 46
syscall: 2085
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz - btw
This Pelles c project test normal call NtQueryPerformanceCounter() too.
No differences ?
C:\Users\Administrator\Downloads\TestDynImpLib_WS2>TestDynImpLib.exe
231 ticks dynamic
231 ticks implib
9842 ticks native
237 ticks dynamic
365 ticks implib
5713 ticks native
231 ticks dynamic
231 ticks implib
5564 ticks native
229 ticks dynamic
230 ticks implib
6366 ticks native
392 ticks dynamic
251 ticks implib
6775 ticks native
C:\Users\Administrator\Downloads\TestDynImpLib_WS2>TestDynImpLib64.exe
125 ticks dynamic
126 ticks implib
4854 ticks native
126 ticks dynamic
139 ticks implib
5131 ticks native
125 ticks dynamic
130 ticks implib
5190 ticks native
126 ticks dynamic
126 ticks implib
5156 ticks native
126 ticks dynamic
126 ticks implib
4873 ticks native
So NtQueryPerformanceCounter() slow in Windows 10 / 11
My test was in Windows 7
Quote from: TimoVJL on April 07, 2025, 10:46:42 PMSo NtQueryPerformanceCounter() slow in Windows 10 / 11
It would seem so.
Windows 10 here.
Please note that the topic is dynamic vs static linking the same function. Timing two entirely different functions, i.e. ntdll NtQueryPerformanceCounter vs kernel32 QueryPerformanceCounter makes absolutely no sense in this context.
Quote from: jj2007 on April 08, 2025, 12:32:18 AMPlease note that the topic is dynamic vs static linking the same function. Timing two entirely different functions, i.e. ntdll NtQueryPerformanceCounter vs kernel32 QueryPerformanceCounter makes absolutely no sense in this context.
That function is kernel function, that do QueryPerformanceCounter() and QueryPerformanceFrequency() one go in same function.
Calling it from user level might slow it down.
Quote from: TimoVJL on April 08, 2025, 12:37:02 AMCalling it from user level might slow it down.
In fact, both call eventually
QueryPerformanceCounter but the ntdll version passes by a Wow64Transition call:
for (int n=0; n<5; n++) {
pQueryPerformanceCounter((LARGE_INTEGER*)&ll1);
_asm int 3; // jmp near [<&api-ms-win-core-profile-l1-1-0.QueryPerformanceCounter>]
for (i=0; i< 1000; i++)
pQueryPerformanceCounter((LARGE_INTEGER*)&ll2);
printf("%d ticks dynamic\n", ll2-ll1);
QueryPerformanceCounter((LARGE_INTEGER*)&ll1);
for (i=0; i< 1000; i++)
QueryPerformanceCounter((LARGE_INTEGER*)&ll2);
printf("%d ticks implib\n", ll2-ll1);
pNTQueryPerformanceCounter((LARGE_INTEGER*)&ll1, (LARGE_INTEGER*)&ll3);
_asm int 3;
// jmp near [Wow64Transition], followed by jmp far 0033:77787009
// jmp near [<&api-ms-win-core-profile-l1-1-0.QueryPerformanceCounter>]
for (i=0; i< 1000; i++)
pNTQueryPerformanceCounter((LARGE_INTEGER*)&ll2, (LARGE_INTEGER*)&ll3);
printf("%d ticks native\n", ll2-ll1);
printf("\n");
}
Thanks for testing that mystery.
pNTQueryPerformanceCounter((LARGE_INTEGER*)&ll1, NULL);
didn't change speed at all in Windows 7