News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

CreateThread overhead

Started by jj2007, January 31, 2021, 12:56:21 PM

Previous topic - Next topic

hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
113 µs  for #1/1
147 µs  for #2/2
324 µs  for #3/3
348 µs  for #4/4
372 µs  for #5/5
410 µs  for #6/6
433 µs  for #7/7
449 µs  for #8/8
473 µs  for #9/9
489 µs  for #10/10
513 µs  for #11/11
530 µs  for #12/12
574 µs  for #13/22
605 µs  for #14/15
654 µs  for #15/16
678 µs  for #16/17
692 µs  for #17/26
715 µs  for #18/18
742 µs  for #19/28
780 µs  for #20/20
823 µs  for #21/21
870 µs  for #22/13
895 µs  for #23/14
923 µs  for #24/35
942 µs  for #25/24
977 µs  for #26/25
997 µs  for #27/38
1046 µs for #28/19
1085 µs for #29/29
1107 µs for #30/40
1145 µs for #31/30
1177 µs for #32/32
1202 µs for #33/33
1227 µs for #34/34
1269 µs for #35/36
1314 µs for #36/37
1352 µs for #37/27
1393 µs for #38/39
1427 µs for #39/31
1470 µs for #40/23

quarantined

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
80 µs   for #1/1
116 µs   for #2/2
159 µs   for #3/3
245 µs   for #4/5
274 µs   for #5/4
341 µs   for #6/7
355 µs   for #7/6
387 µs   for #8/9
430 µs   for #9/8
480 µs   for #10/11
526 µs   for #11/13
553 µs   for #12/10
576 µs   for #13/12
613 µs   for #14/15
664 µs   for #15/17
714 µs   for #16/19
744 µs   for #17/14
773 µs   for #18/16
789 µs   for #19/18
833 µs   for #20/21
895 µs   for #21/23
937 µs   for #22/25
978 µs   for #23/27
1016 µs   for #24/29
1054 µs   for #25/31
1101 µs   for #26/33
1129 µs   for #27/20
1156 µs   for #28/22
1173 µs   for #29/24
1216 µs   for #30/35
1272 µs   for #31/39
1279 µs   for #32/28
1296 µs   for #33/30
1321 µs   for #34/26
1374 µs   for #35/37
1392 µs   for #37/32
1381 µs   for #36/34
1485 µs   for #40/40
1419 µs   for #38/36
1445 µs   for #39/38
:tongue:

win 7. 32 bit

quarantined

Quote from: jj2007 on April 20, 2021, 07:20:50 AM
That's pretty fast. I wonder whether the WOW64 overhead plays a role here.

For comparison from win 10, 64 bit...

215 µs   for #1/2
303 µs   for #2/1
516 µs   for #3/4
544 µs   for #4/5
567 µs   for #5/6
484 µs   for #6/3
750 µs   for #7/8
758 µs   for #8/7
901 µs   for #9/10
874 µs   for #10/9
1104 µs   for #11/11
1132 µs   for #12/12
1154 µs   for #13/13
1190 µs   for #14/14
1348 µs   for #15/15
1476 µs   for #16/16
1499 µs   for #17/17
1672 µs   for #18/19
1645 µs   for #19/18
1875 µs   for #20/20
1902 µs   for #21/21
1929 µs   for #22/22
2016 µs   for #23/23
2142 µs   for #24/24
2173 µs   for #25/25
2196 µs   for #26/26
2332 µs   for #27/27
2483 µs   for #28/28
2510 µs   for #29/29
2533 µs   for #30/30
2714 µs   for #31/32
2768 µs   for #32/31
2922 µs   for #33/33
2949 µs   for #34/34
2972 µs   for #35/35
2999 µs   for #36/36
3217 µs   for #37/38
3254 µs   for #38/37
3338 µs   for #39/39
3366 µs   for #40/40

FORTRANS

Hi Jochen,

   Both programs, three computers.  Two XP {assumed 32-bit},
one 8.1.  Just noticed when posting, one negative timing.  So,
I hope you find this either useful, or amusing.

Intel(R) Pentium(R) M processor 1.70GHz
469 µs for #1/1
529 µs for #2/2
561 µs for #3/3
592 µs for #4/4
622 µs for #5/5
651 µs for #6/6
682 µs for #7/7
712 µs for #8/8
742 µs for #9/9
772 µs for #10/10
803 µs for #11/11
833 µs for #12/12
863 µs for #13/13
893 µs for #14/14
923 µs for #15/15
953 µs for #16/16
983 µs for #17/17
1014 µs for #18/18
1538 µs for #19/19
1571 µs for #20/20
1602 µs for #21/21
1632 µs for #22/22
1772 µs for #23/23
1804 µs for #24/24
1835 µs for #25/25
1865 µs for #26/26
1895 µs for #27/27
2001 µs for #28/28
2032 µs for #29/29
2065 µs for #30/30
2096 µs for #31/31
2126 µs for #32/32
2156 µs for #33/33
2186 µs for #34/34
2216 µs for #35/35
2345 µs for #36/36
2376 µs for #37/37
2405 µs for #38/38
2434 µs for #39/39
2464 µs for #40/40

Intel(R) Pentium(R) M processor 1.70GHz
42 µs for creating the thread
thread resumed after 74 µs for #1
thread resumed after 5 µs for #2
thread resumed after 4 µs for #3
thread resumed after 3 µs for #4
thread resumed after 3 µs for #5
thread resumed after 3 µs for #6
thread resumed after 3 µs for #7
thread resumed after 3 µs for #8
thread resumed after 3 µs for #9
thread resumed after 3 µs for #10
thread resumed after 3 µs for #11
thread resumed after 3 µs for #12
thread resumed after 3 µs for #13
thread resumed after 3 µs for #14
thread resumed after 3 µs for #15
thread resumed after 3 µs for #16
thread resumed after 3 µs for #17
thread resumed after 3 µs for #18
thread resumed after 3 µs for #19
thread resumed after 3 µs for #20
thread resumed after 3 µs for #21
thread resumed after 3 µs for #22
thread resumed after 3 µs for #23
thread resumed after 3 µs for #24
thread resumed after 3 µs for #25
thread resumed after 3 µs for #26
thread resumed after 3 µs for #27
thread resumed after 3 µs for #28
thread resumed after 3 µs for #29
thread resumed after 3 µs for #30
thread resumed after 3 µs for #31
--- hit any key ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz
264 µs for #1/1
357 µs for #2/2
383 µs for #3/3
510 µs for #4/5
609 µs for #5/6
672 µs for #6/4
703 µs for #7/8
745 µs for #8/7
767 µs for #9/9
1128 µs for #10/10
1188 µs for #11/11
1241 µs for #12/12
1284 µs for #13/13
1337 µs for #14/14
1397 µs for #15/15
1504 µs for #16/16
1669 µs for #17/17
1769 µs for #18/18
1826 µs for #19/19
1951 µs for #20/20
1988 µs for #21/21
2036 µs for #22/22
2190 µs for #23/23
2230 µs for #24/24
2277 µs for #25/25
2343 µs for #26/26
2384 µs for #27/27
2585 µs for #28/28
2626 µs for #29/29
2673 µs for #30/30
2728 µs for #31/31
2777 µs for #32/32
2960 µs for #33/34
3066 µs for #34/33
3147 µs for #35/35
3185 µs for #36/37
3250 µs for #37/36
3313 µs for #38/38
3386 µs for #39/39
3511 µs for #40/40

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz
61 µs for creating the thread
thread resumed after 252 µs for #1
thread resumed after 42 µs for #2
thread resumed after 7 µs for #3
thread resumed after 31 µs for #4
thread resumed after 18 µs for #5
thread resumed after 23 µs for #6
thread resumed after 28 µs for #7
thread resumed after 49 µs for #8
thread resumed after 19 µs for #9
thread resumed after 40 µs for #10
thread resumed after 31 µs for #11
thread resumed after 59 µs for #12
thread resumed after 43 µs for #13
thread resumed after 40 µs for #14
thread resumed after 30 µs for #15
thread resumed after 32 µs for #16
thread resumed after 33 µs for #17
thread resumed after 32 µs for #18
thread resumed after 31 µs for #19
thread resumed after 35 µs for #20
thread resumed after 46 µs for #21
thread resumed after 35 µs for #22
thread resumed after 39 µs for #23
thread resumed after 20 µs for #24
thread resumed after 37 µs for #25
thread resumed after 27 µs for #26
thread resumed after 45 µs for #27
thread resumed after 29 µs for #28
thread resumed after 48 µs for #29
thread resumed after 28 µs for #30
thread resumed after 35 µs for #31
--- hit any key ---

Genuine Intel(R) CPU           T2400  @ 1.83GHz
237 µs for #1/1
369 µs for #2/2
466 µs for #3/3
545 µs for #4/4
634 µs for #5/5
758 µs for #6/7
877 µs for #7/6
991 µs for #8/8
1010 µs for #9/9
1119 µs for #10/10
1144 µs for #11/11
1305 µs for #12/12
1381 µs for #13/13
1494 µs for #14/14
1609 µs for #15/15
1703 µs for #16/16
1798 µs for #17/17
1919 µs for #18/18
1944 µs for #19/19
2102 µs for #20/20
2542 µs for #21/21
2658 µs for #22/23
2693 µs for #23/27
2667 µs for #24/25
2856 µs for #25/24
2875 µs for #26/22
2988 µs for #27/29
3105 µs for #28/26
-2147483648 µs for #29/28
3238 µs for #30/31
3321 µs for #31/30
3429 µs for #32/32
3459 µs for #33/33
3609 µs for #35/35
3609 µs for #35/35
3722 µs for #36/36
3874 µs for #37/37
3963 µs for #38/38
4048 µs for #39/39
4137 µs for #40/40

Genuine Intel(R) CPU           T2400  @ 1.83GHz
118 µs for creating the thread
thread resumed after 177 µs for #1
thread resumed after 13 µs for #2
thread resumed after 10 µs for #3
thread resumed after 17 µs for #4
thread resumed after 11 µs for #5
thread resumed after 16 µs for #6
thread resumed after 9 µs for #7
thread resumed after 9 µs for #8
thread resumed after 8 µs for #9
thread resumed after 14 µs for #10
thread resumed after 11 µs for #11
thread resumed after 13 µs for #12
thread resumed after 11 µs for #13
thread resumed after 61 µs for #14
thread resumed after 10 µs for #15
thread resumed after 15 µs for #16
thread resumed after 17 µs for #17
thread resumed after 15 µs for #18
thread resumed after 8 µs for #19
thread resumed after 15 µs for #20
thread resumed after 11 µs for #21
thread resumed after 9 µs for #22
thread resumed after 11 µs for #23
thread resumed after 17 µs for #24
thread resumed after 9 µs for #25
thread resumed after 12 µs for #26
thread resumed after 11 µs for #27
thread resumed after 16 µs for #28
thread resumed after 10 µs for #29
thread resumed after 8 µs for #30
thread resumed after 10 µs for #31
--- hit any key ---


Cheers,

Steve N.

jj2007

Thanks, Steve. The first one is definitely fast ("thread resumed after 3 µs"), can you verify if it's a 32-bit OS?
If so, it would give us a hint how much the translation of 32-bit calls to the 64-bit OS costs in practice.

Attached a test with a fast WinAPI call: GetTickCount. Here is the source:
include \masm32\MasmBasic\MasmBasic.inc
  Init
  PrintCpu 0
  Print Str$("This is Windows version %i", MbWinVersion()), Str$(".%i", ecx)
  Print Str$(", build %i, ", MbWinVersion(build))
  .if Win64()
PrintLine "64-bit OS"
  .else
PrintLine "32-bit OS"
  .endif
  xor ecx, ecx
  .Repeat
inc ecx ; warmup
  .Until Sign?
  REPEAT 3
NanoTimer()
xor ebx, ebx
.Repeat
invoke GetTickCount
inc ebx
.Until ebx>99999999
  Print NanoTimer$(), Str$(" for %i Million times GetTickCount\n", ebx/1000000)
  ENDM
  Inkey
EndOfCode


My results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
This is Windows version 6.1, build 7601, 64-bit OS
584 ms for 100 Million times GetTickCount
533 ms for 100 Million times GetTickCount
539 ms for 100 Million times GetTickCount

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
This is Windows version 5.1, build 2600, 32-bit OS
321 ms for 100 Million times GetTickCount
279 ms for 100 Million times GetTickCount
283 ms for 100 Million times GetTickCount


The second one is a VM :cool:

For comparison, the empty loop:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
This is Windows version 6.1, build 7601, 64-bit OS
35 ms for 100 Million times empty loop
35 ms for 100 Million times empty loop
42 ms for 100 Million times empty loop

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
This is Windows version 5.1, build 2600, 32-bit OS
34 ms for 100 Million times empty loop
35 ms for 100 Million times empty loop
34 ms for 100 Million times empty loop


So it seems that the translation of a WinAPI call from 32-bit to 64-bit land costs around 2.5 milliseconds per Million calls :cool:

FORTRANS

Hi,

Quote from: jj2007 on April 22, 2021, 10:20:04 PM
Thanks, Steve. The first one is definitely fast ("thread resumed after 3 µs"), can you verify if it's a 32-bit OS?

   Well, it's a 32-bit CPU, but anyway, here's your program's
output and that of VER.

This is Windows version 5.1, build 2600, 32-bit OS
603 ms for 100 Million times GetTickCount
538 ms for 100 Million times GetTickCount
601 ms for 100 Million times GetTickCount

Microsoft Windows XP [Version 5.1.2600]


Regards,

Steve

jj2007

Thanks :biggrin:

Here is the 64-bit version:
include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
ticks QWORD ?
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  xor edi, edi
@OuterLoop:
jinvoke GetTickCount
mov ticks, rax
xor ebx, ebx
@@: jinvoke GetTickCount
inc ebx
cmp ebx, 99999999
jb @B
sub rax, ticks
PrintLine Str$("%i ms for 100 Million x GetTickCount", rax)
  inc edi
  cmp edi, 3
  jb @OuterLoop
  Inkey
EndOfCode

This program was assembled with ml64 in 64-bit format.
343 ms for 100 Million x GetTickCount
328 ms for 100 Million x GetTickCount
327 ms for 100 Million x GetTickCount


A bit slow compared to the 32-bit code in the 32-bit OS, but the 2.5ms/Million overhead is not present :cool:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
This is Windows version 5.1, build 2600, 32-bit OS
321 ms for 100 Million times GetTickCount
279 ms for 100 Million times GetTickCount
283 ms for 100 Million times GetTickCount

quarantined

---------------  For 'GetTickCount.exe' (32 bit version)

64 bit cpu Windows XP 32 bit
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
This is Windows version 5.1, build 2600, 32-bit OS
300 ms for 100 Million times GetTickCount
267 ms for 100 Million times GetTickCount
267 ms for 100 Million times GetTickCount

64 bit cpu Windows 7 32 bit
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
This is Windows version 6.1, build 7601, 32-bit OS
478 ms for 100 Million times GetTickCount
435 ms for 100 Million times GetTickCount
437 ms for 100 Million times GetTickCount




--------------------------------------------
64 bit cpu Windows 10 64 bit
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
This is Windows version 10.0, build 19042, 64-bit OS
470 ms for 100 Million times GetTickCount
435 ms for 100 Million times GetTickCount
434 ms for 100 Million times GetTickCount

=============================================
---------------  For 'GetTickCount64.exe' (64 bit version)

64 bit cpu Windows 10 64 bit
This program was assembled with ml64 in 64-bit format.
203 ms for 100 Million x GetTickCount
203 ms for 100 Million x GetTickCount
203 ms for 100 Million x GetTickCount

Sorry I don't have a native 32 bit box to test on.


jj2007


LiaoMi

Quote from: jj2007 on April 23, 2021, 12:53:05 AM
Thanks :biggrin:

Here is the 64-bit version:
include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
ticks QWORD ?
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  xor edi, edi
@OuterLoop:
jinvoke GetTickCount
mov ticks, rax
xor ebx, ebx
@@: jinvoke GetTickCount
inc ebx
cmp ebx, 99999999
jb @B
sub rax, ticks
PrintLine Str$("%i ms for 100 Million x GetTickCount", rax)
  inc edi
  cmp edi, 3
  jb @OuterLoop
  Inkey
EndOfCode

This program was assembled with ml64 in 64-bit format.
343 ms for 100 Million x GetTickCount
328 ms for 100 Million x GetTickCount
327 ms for 100 Million x GetTickCount


A bit slow compared to the 32-bit code in the 32-bit OS, but the 2.5ms/Million overhead is not present :cool:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
This is Windows version 5.1, build 2600, 32-bit OS
321 ms for 100 Million times GetTickCount
279 ms for 100 Million times GetTickCount
283 ms for 100 Million times GetTickCount


win10 x64 i7-4810mq

This program was assembled with ml64 in 64-bit format.
203 ms for 100 Million x GetTickCount
188 ms for 100 Million x GetTickCount
203 ms for 100 Million x GetTickCount

hutch--

Haswell E/EP @ 4gig.


This program was assembled with ml64 in 64-bit format.
125 ms for 100 Million x GetTickCount
125 ms for 100 Million x GetTickCount
125 ms for 100 Million x GetTickCount

hutch--

Bothered to have a look, you certainly have some strange code for 64 bit. Looks suspiciously like hybrid STDCALL 32 bit code embedded in 64 bit.

0x140001248:
stosq qword ptr [rdi], rax         ; no REP

0x14000119b:
push rsi
push rdi
push rbx
push rdx
mov rsi, rcx
mov rdi, rdx
xor rbx, rbx

pop rdx
pop rbx
pop rdi
pop rsi
ret

jdebP:
mov qword ptr [0x140002700], rsp
lea rsp, [0x140002790]
push r15
push r14
push r13
push r12
push r11
push r10
push r9
push r8
push rdi
push rsi
push rbp
push rsp
push rbx
push rdx
push rcx
push rax
lea rdx, [0x140002700]
mov rax, qword ptr [rsp-0x10]
mov rdx, qword ptr [rax]
sub rdx, 5
mov qword ptr [rsp-8], rdx
mov qword ptr [rsp+0x20], rax
add qword ptr [rsp+0x20], 8
xchg rax, rsp                      ; antique junk
or eax, 0xffffffff
cdq                                ; more antique junk
lea rcx, [0x140002790]
fxsave ptr [rcx]
ret

jj2007

Quote from: hutch-- on April 23, 2021, 08:15:08 AM
Bothered to have a look, you certainly have some strange code for 64 bit. Looks suspiciously like hybrid STDCALL 32 bit code embedded in 64 bit.

Any violations of the x64 ABI? Shall we call the code police?
:tongue:

hutch--

> Any violations of the x64 ABI

Stack twiddling again, dodgy unreliable code.

jj2007

Is that just your opinion, or can you prove it, maybe with a crispy example of crashing code?