Author Topic: The joy of beating the CRT by a factor 10  (Read 1607 times)

jj2007

  • Member
  • *****
  • Posts: 13336
  • Assembly is fun ;-)
    • MasmBasic
The joy of beating the CRT by a factor 10
« on: May 05, 2022, 04:45:16 AM »
Code: [Select]
**** generating 10000 random numbers & writing them to a buffer ****
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++-++++++11 of 20 tests valid,
11906   kCycles for 10 * random Str$()  MasmBasic
18391   kCycles for 10 * random dwtoa   Masm32 SDK
16720   kCycles for 10 * random Str$()  MasmBasic with saving
23448   kCycles for 10 * random dwtoa   Masm32 SDK with saving
139015  kCycles for 10 * random sprintf CRT
156212  kCycles for 10 * random sprintf CRT with saving

11292   kCycles for 10 * random Str$()  MasmBasic
18338   kCycles for 10 * random dwtoa   Masm32 SDK
16328   kCycles for 10 * random Str$()  MasmBasic with saving
29636   kCycles for 10 * random dwtoa   Masm32 SDK with saving
139799  kCycles for 10 * random sprintf CRT
145937  kCycles for 10 * random sprintf CRT with saving

13457   kCycles for 10 * random Str$()  MasmBasic
24111   kCycles for 10 * random dwtoa   Masm32 SDK
17040   kCycles for 10 * random Str$()  MasmBasic with saving
23754   kCycles for 10 * random dwtoa   Masm32 SDK with saving
139204  kCycles for 10 * random sprintf CRT
121430  kCycles for 10 * random sprintf CRT with saving

101     bytes for random Str$()  MasmBasic
113     bytes for random dwtoa   Masm32 SDK
89      bytes for random Str$()  MasmBasic with saving
90      bytes for random dwtoa   Masm32 SDK with saving
121     bytes for random sprintf CRT
89      bytes for random sprintf CRT with saving

edi points to a buffer and gets filled with strings of the format n<tab>random number<CrLf>:
Code: [Select]
0 220383915
1 771014011
2 2113869234
3 901510269
4 1077232086
5 1507169316
etc

Code: [Select]
MakeStringsMB proc uses edi
  xor ecx, ecx
  .Repeat
Str$(ecx, dest:edi) ; write #number to edi
mov edi, edx ; edx points to the end of the string
mov al, 9
stosb
Str$(Rand(7fffffffh), dest:edi) ; write #random to edi
mov edi, edx
mov ax, 0A0Dh
stosw
inc ecx
  .Until ecx>numstrings
  xor eax, eax
  stosb
  ret
MakeStringsMB endp
Code: [Select]
MakeStringsM32 proc uses edi ebx
  xor ebx, ebx
  .Repeat
invoke dwtoa, ebx, edi ; pretty fast Masm32 library algo
add edi, len(edi) ; dwtoa doesn't tell us how many bytes were copied...
mov al, 9
stosb
invoke dwtoa, Rand(7fffffffh), edi ; use MasmBasic Rand() to make sure the choice of PRNG does not influence timings
add edi, len(edi)
mov ax, 0A0Dh
stosw
inc ebx
  .Until ebx>numstrings
  xor eax, eax
  stosb
  ret
MakeStringsM32 endp
Code: [Select]
MakeStringsCRT proc uses edi ebx
  xor ebx, ebx
  .Repeat
invoke crt_sprintf, edi, chr$("%i", 9), ebx ; slow CRT
add edi, rv(lstrlen, edi) ; sprintf doesn't tell us how many bytes were copied...
invoke crt_sprintf, edi, chr$("%i", 13, 10), Rand(7fffffffh)
add edi, rv(lstrlen, edi)
inc ebx
  .Until ebx>numstrings
  xor eax, eax
  stosb
  ret
MakeStringsCRT endp
« Last Edit: May 05, 2022, 05:45:43 AM by jj2007 »

HSE

  • Member
  • *****
  • Posts: 2263
  • AMD 7-32 / i3 10-64
Re: The joy of beating the CRT by a factor 10
« Reply #1 on: May 05, 2022, 11:31:15 AM »
Hi JJ !!

What happen if you put MasmBasic basic routine in a dll?

Regards, HSE.
Equations in Assembly: SmplMath

jj2007

  • Member
  • *****
  • Posts: 13336
  • Assembly is fun ;-)
    • MasmBasic
Re: The joy of beating the CRT by a factor 10
« Reply #2 on: May 05, 2022, 06:16:47 PM »
What happen if you put MasmBasic basic routine in a dll?

Hi Hector,

There is an even better solution: the static library in \Masm32\MasmBasic\MasmBasic.lib :thumbsup:

HSE

  • Member
  • *****
  • Posts: 2263
  • AMD 7-32 / i3 10-64
Re: The joy of beating the CRT by a factor 10
« Reply #3 on: May 05, 2022, 08:48:09 PM »
 :biggrin: :biggrin: :biggrin: I'm thinking in fair timings! CRT functions are in DLL.
Equations in Assembly: SmplMath

jj2007

  • Member
  • *****
  • Posts: 13336
  • Assembly is fun ;-)
    • MasmBasic
Re: The joy of beating the CRT by a factor 10
« Reply #4 on: May 05, 2022, 08:51:04 PM »
:biggrin: :biggrin: :biggrin: I'm thinking in fair timing! CRT functions are in DLL.

So what? The CRT DLL resides in the address space of your process. You may have two cycles more for an additional jmp, but that's all.

HSE

  • Member
  • *****
  • Posts: 2263
  • AMD 7-32 / i3 10-64
Re: The joy of beating the CRT by a factor 10
« Reply #5 on: May 05, 2022, 09:03:10 PM »
So what? The CRT DLL resides in the address space of your process. You may have two cycles more for an additional jmp, but that's all.
Yes, everybody say that. I don't know. I don't remember that timings,
Equations in Assembly: SmplMath

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 10031
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: The joy of beating the CRT by a factor 10
« Reply #6 on: May 05, 2022, 09:09:45 PM »
A DLL proc is never as fast as a local proc. You can test this by using an identical procedure locally and in a DLL.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

HSE

  • Member
  • *****
  • Posts: 2263
  • AMD 7-32 / i3 10-64
Re: The joy of beating the CRT by a factor 10
« Reply #7 on: May 05, 2022, 09:38:36 PM »
A DLL proc is never as fast as a local proc. You can test this by using an identical procedure locally and in a DLL.

 :thumbsup:
Equations in Assembly: SmplMath

jj2007

  • Member
  • *****
  • Posts: 13336
  • Assembly is fun ;-)
    • MasmBasic
Re: The joy of beating the CRT by a factor 10
« Reply #8 on: May 05, 2022, 10:54:26 PM »
A DLL proc is never as fast as a local proc. You can test this by using an identical procedure locally and in a DLL.

Oompf.... - you are right, Hutch. But why is that so? Both routines are being called in exactly the same way... :rolleyes:

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6852    cycles for 100 * CRT strlen
5609    cycles for 100 * CRT strlen local
2017    cycles for 100 * MasmBasic Len
5397    cycles for 100 * Masm32 StrLen
8678    cycles for 100 * Masm32 len

6906    cycles for 100 * CRT strlen
5571    cycles for 100 * CRT strlen local
2012    cycles for 100 * MasmBasic Len
5396    cycles for 100 * Masm32 StrLen
8728    cycles for 100 * Masm32 len

6853    cycles for 100 * CRT strlen
5571    cycles for 100 * CRT strlen local
2021    cycles for 100 * MasmBasic Len
5405    cycles for 100 * Masm32 StrLen
8692    cycles for 100 * Masm32 len

6847    cycles for 100 * CRT strlen
5613    cycles for 100 * CRT strlen local
2009    cycles for 100 * MasmBasic Len
5397    cycles for 100 * Masm32 StrLen
8687    cycles for 100 * Masm32 len

14      bytes for CRT strlen
137     bytes for CRT strlen local
10      bytes for MasmBasic Len
10      bytes for Masm32 StrLen
10      bytes for Masm32 len

100     = eax CRT strlen
100     = eax CRT strlen local
100     = eax MasmBasic Len
100     = eax Masm32 StrLen
100     = eax Masm32 len

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 10031
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: The joy of beating the CRT by a factor 10
« Reply #9 on: May 05, 2022, 11:26:55 PM »
I guess it has to do with the load address and calling method of a DLL function. If it was a larger function, the call overhead would matter less and less but in practice, that was the result I found.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 13336
  • Assembly is fun ;-)
    • MasmBasic
Re: The joy of beating the CRT by a factor 10
« Reply #10 on: May 05, 2022, 11:30:56 PM »
I see the result but have difficulties to believe it :sad:

What happens between the call near msvcrt.strlen and the arrival at 76C443D3? Olly says nothing happens in between :cool:

Code: [Select]
00401040  /$  BB 63000000   mov ebx, 63
00401045  |.  8D49 00       lea ecx, [ecx]
00401048  |>  CC            /int3
00401049  |.  68 2E804000   |push offset 0040802E                    ; /string = "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
0040104E  |.  FF15 AC8F4000 |call near [<&msvcrt.strlen>]            ; \MSVCRT.strlen
00401054  |.  83C4 04       |add esp, 4
00401057  |.  4B            |dec ebx
00401058  |.^ 79 EE         \jns short 00401048
0040105A  \.  C3            retn
...
76C443D2  \.  C3            retn
76C443D3  /$  8B4C24 04     mov ecx, [string]                        ; ASCII "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
76C443D7  |.  F7C1 03000000 test ecx, 00000003
76C443DD  |.  74 1A         jz short 76C443F9
76C443DF  |>  8A01          /mov al, [ecx]
76C443E1  |.  83C1 01       |add ecx, 1
76C443E4  |.  84C0          |test al, al
76C443E6  |.  74 4C         |jz short 76C44434
76C443E8  |.  F7C1 03000000 |test ecx, 00000003
76C443EE  |.^ 75 EF         \jnz short 76C443DF
76C443F0  |.  83C0 00       add eax, 0

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 10031
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: The joy of beating the CRT by a factor 10
« Reply #11 on: May 06, 2022, 12:29:03 AM »
I doubt that a disassembly will show you what is happening. The mechanism of a DLL is more complex than an executable, preferred load address, calling machanism and its all OS based, not simple mnemonics. It probably means you are only beating it by 9.9^
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 13336
  • Assembly is fun ;-)
    • MasmBasic
Re: The joy of beating the CRT by a factor 10
« Reply #12 on: May 06, 2022, 06:13:30 AM »
I made another test:

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7054    cycles for 100 * CRT strlen
5627    cycles for 100 * CRT strlen local
383     cycles for 100 * DoNothing local
379     cycles for 100 * DoNothing DLL

6854    cycles for 100 * CRT strlen
5587    cycles for 100 * CRT strlen local
382     cycles for 100 * DoNothing local
379     cycles for 100 * DoNothing DLL

6856    cycles for 100 * CRT strlen
5608    cycles for 100 * CRT strlen local
386     cycles for 100 * DoNothing local
386     cycles for 100 * DoNothing DLL

6876    cycles for 100 * CRT strlen
5573    cycles for 100 * CRT strlen local
382     cycles for 100 * DoNothing local
379     cycles for 100 * DoNothing DLL

There is something special about the CRT: The ratio strlen DLL : strlen local is 1.23 :cool:

DLL attached - see yourself.

TimoVJL

  • Member
  • *****
  • Posts: 1235
Re: The joy of beating the CRT by a factor 10
« Reply #13 on: May 06, 2022, 07:54:55 PM »
Code: [Select]
AMD Athlon(tm) II X2 220 Processor (SSE3)

13350   cycles for 100 * CRT strlen
11621   cycles for 100 * CRT strlen local
4554    cycles for 100 * MasmBasic Len
7781    cycles for 100 * Masm32 StrLen
14032   cycles for 100 * Masm32 len

13175   cycles for 100 * CRT strlen
12431   cycles for 100 * CRT strlen local
4564    cycles for 100 * MasmBasic Len
7804    cycles for 100 * Masm32 StrLen
19606   cycles for 100 * Masm32 len

13177   cycles for 100 * CRT strlen
11576   cycles for 100 * CRT strlen local
4543    cycles for 100 * MasmBasic Len
7807    cycles for 100 * Masm32 StrLen
19776   cycles for 100 * Masm32 len

13126   cycles for 100 * CRT strlen
11224   cycles for 100 * CRT strlen local
4552    cycles for 100 * MasmBasic Len
7781    cycles for 100 * Masm32 StrLen
14100   cycles for 100 * Masm32 len

14      bytes for CRT strlen
137     bytes for CRT strlen local
10      bytes for MasmBasic Len
10      bytes for Masm32 StrLen
10      bytes for Masm32 len

100     = eax CRT strlen
100     = eax CRT strlen local
100     = eax MasmBasic Len
100     = eax Masm32 StrLen
100     = eax Masm32 len

--- ok ---
May the source be with you

TimoVJL

  • Member
  • *****
  • Posts: 1235
Re: The joy of beating the CRT by a factor 10
« Reply #14 on: May 06, 2022, 07:57:19 PM »
Crash
Code: [Select]
not found (line 49):    InstrDLL
AMD Athlon(tm) II X2 220 Processor (SSE3)

13127   cycles for 100 * CRT strlen
10990   cycles for 100 * CRT strlen local
4523    cycles for 100 * MasmBasic Len
7781    cycles for 100 * Masm32 StrLen
14378   cycles for 100 * Masm32 len
40584   cycles for 100 * Instr (statically linked)
May the source be with you