May I have some timings, please?
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
20314 cycles for 100 * MasmBasic Val
39045 cycles for 100 * CRT a2ud
2536 cycles for 100 * Masm32 SDK hex2bin
2686 cycles for 100 * HexStr2Bin
2092 cycles for 100 * HexStr2BinT (with table)
9181 cycles for 100 * HexStr2X (64-bit)
19732 cycles for 100 * MasmBasic Val
40132 cycles for 100 * CRT a2ud
3281 cycles for 100 * Masm32 SDK hex2bin
3952 cycles for 100 * HexStr2Bin
2231 cycles for 100 * HexStr2BinT (with table)
8839 cycles for 100 * HexStr2X (64-bit)
19825 cycles for 100 * MasmBasic Val
39673 cycles for 100 * CRT a2ud
3598 cycles for 100 * Masm32 SDK hex2bin
2747 cycles for 100 * HexStr2Bin
2054 cycles for 100 * HexStr2BinT (with table)
8678 cycles for 100 * HexStr2X (64-bit)
20010 cycles for 100 * MasmBasic Val
39973 cycles for 100 * CRT a2ud
3242 cycles for 100 * Masm32 SDK hex2bin
3565 cycles for 100 * HexStr2Bin
2150 cycles for 100 * HexStr2BinT (with table)
9123 cycles for 100 * HexStr2X (64-bit)
19712 cycles for 100 * MasmBasic Val
39055 cycles for 100 * CRT a2ud
2539 cycles for 100 * Masm32 SDK hex2bin
3479 cycles for 100 * HexStr2Bin
2385 cycles for 100 * HexStr2BinT (with table)
9158 cycles for 100 * HexStr2X (64-bit)
3 bytes for MasmBasic Val
19 bytes for CRT a2ud
12 bytes for Masm32 SDK hex2bin
48 bytes for HexStr2Bin
76 bytes for HexStr2BinT (with table)
104 bytes for HexStr2X (64-bit)
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT a2ud
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax HexStr2Bin
12ABCDEFh eax HexStr2BinT (with table)
56789DEFh eax HexStr2X (64-bit)
Remarks:
- The string used is 12AbCdEfh
- MasmBasic Val is slower because it's an allrounder; you can throw $123, 456h, 12345, 010010101y at it, and you'll always get the correct result. Therefore it's only twice as fast as crt_sscanf :biggrin:
- The last one, HexStr2X, uses the string 1234AbcD56789Defh; however, it returns the result in xmm0, therefore (for technical reasons) eax shows only the second half, 56789DEFh
Two things:
- Is my HexString2Bin() in there somewhere?
- Again: who really cares how fast this is? It's hard to imagine a use for this where picayune differences in speed really matter to anyone.
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
15921 cycles for 100 * MasmBasic Val
26961 cycles for 100 * CRT a2ud
2126 cycles for 100 * Masm32 SDK hex2bin
2209 cycles for 100 * HexStr2Bin
1845 cycles for 100 * HexStr2BinT (with table)
6857 cycles for 100 * HexStr2X (64-bit)
16110 cycles for 100 * MasmBasic Val
26237 cycles for 100 * CRT a2ud
2205 cycles for 100 * Masm32 SDK hex2bin
2314 cycles for 100 * HexStr2Bin
1763 cycles for 100 * HexStr2BinT (with table)
6459 cycles for 100 * HexStr2X (64-bit)
15842 cycles for 100 * MasmBasic Val
26733 cycles for 100 * CRT a2ud
2039 cycles for 100 * Masm32 SDK hex2bin
2167 cycles for 100 * HexStr2Bin
1740 cycles for 100 * HexStr2BinT (with table)
6541 cycles for 100 * HexStr2X (64-bit)
15816 cycles for 100 * MasmBasic Val
26510 cycles for 100 * CRT a2ud
2085 cycles for 100 * Masm32 SDK hex2bin
2287 cycles for 100 * HexStr2Bin
1792 cycles for 100 * HexStr2BinT (with table)
6441 cycles for 100 * HexStr2X (64-bit)
15748 cycles for 100 * MasmBasic Val
26419 cycles for 100 * CRT a2ud
2087 cycles for 100 * Masm32 SDK hex2bin
2131 cycles for 100 * HexStr2Bin
1665 cycles for 100 * HexStr2BinT (with table)
6532 cycles for 100 * HexStr2X (64-bit)
3 bytes for MasmBasic Val
19 bytes for CRT a2ud
12 bytes for Masm32 SDK hex2bin
48 bytes for HexStr2Bin
76 bytes for HexStr2BinT (with table)
104 bytes for HexStr2X (64-bit)
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT a2ud
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax HexStr2Bin
12ABCDEFh eax HexStr2BinT (with table)
56789DEFh eax HexStr2X (64-bit)
--- ok ---
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
21786 cycles for 100 * MasmBasic Val
38609 cycles for 100 * CRT a2ud
2921 cycles for 100 * Masm32 SDK hex2bin
2688 cycles for 100 * HexStr2Bin
1686 cycles for 100 * HexStr2BinT (with table)
6975 cycles for 100 * HexStr2X (64-bit)
19965 cycles for 100 * MasmBasic Val
38482 cycles for 100 * CRT a2ud
2960 cycles for 100 * Masm32 SDK hex2bin
3349 cycles for 100 * HexStr2Bin
1829 cycles for 100 * HexStr2BinT (with table)
6983 cycles for 100 * HexStr2X (64-bit)
19990 cycles for 100 * MasmBasic Val
38434 cycles for 100 * CRT a2ud
2973 cycles for 100 * Masm32 SDK hex2bin
3279 cycles for 100 * HexStr2Bin
1725 cycles for 100 * HexStr2BinT (with table)
6875 cycles for 100 * HexStr2X (64-bit)
19922 cycles for 100 * MasmBasic Val
38457 cycles for 100 * CRT a2ud
3042 cycles for 100 * Masm32 SDK hex2bin
2749 cycles for 100 * HexStr2Bin
1725 cycles for 100 * HexStr2BinT (with table)
6979 cycles for 100 * HexStr2X (64-bit)
19995 cycles for 100 * MasmBasic Val
38463 cycles for 100 * CRT a2ud
7092 cycles for 100 * Masm32 SDK hex2bin
4304 cycles for 100 * HexStr2Bin
5392 cycles for 100 * HexStr2BinT (with table)
14712 cycles for 100 * HexStr2X (64-bit)
3 bytes for MasmBasic Val
19 bytes for CRT a2ud
12 bytes for Masm32 SDK hex2bin
48 bytes for HexStr2Bin
76 bytes for HexStr2BinT (with table)
104 bytes for HexStr2X (64-bit)
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT a2ud
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax HexStr2Bin
12ABCDEFh eax HexStr2BinT (with table)
56789DEFh eax HexStr2X (64-bit)
--- ok ---
Quote from: NoCforMe on February 11, 2024, 07:20:27 AM- Is my HexString2Bin() in there somewhere?
I don't think so, but check yourself. Search the source for
endp.
Quote- Again: who really cares how fast this is?
The OP?
@fearless & Héctor: Thanks :thup:
Quote from: jj2007 on February 11, 2024, 08:11:41 AMQuote from: NoCforMe on February 11, 2024, 07:20:27 AM- Again: who really cares how fast this is?
The OP?
Speaking of which:
Quote from: hyder on January 30, 2024, 08:09:18 AMA while back I posted a function that converts 64-bit numeric values to a string of hexadecimal digits. After much work, I've come up with an algorithm for the reverse operation: converting hexadecimal strings to a 64-bit numeric value.
So Mr. Hyde, do you have anything to say about what's evolved here from your post?
Version 2 beats the CRT by a factor 17:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
19526 cycles for 100 * MasmBasic Val
39707 cycles for 100 * CRT a2ud
3668 cycles for 100 * Masm32 SDK hex2bin
3695 cycles for 100 * HexStr2Bin
2588 cycles for 100 * HexStr2BinT (with table)
2139 cycles for 100 * HexStr2X (64-bit)
20294 cycles for 100 * MasmBasic Val
39182 cycles for 100 * CRT a2ud
3594 cycles for 100 * Masm32 SDK hex2bin
3672 cycles for 100 * HexStr2Bin
2596 cycles for 100 * HexStr2BinT (with table)
2207 cycles for 100 * HexStr2X (64-bit)
19826 cycles for 100 * MasmBasic Val
39041 cycles for 100 * CRT a2ud
2552 cycles for 100 * Masm32 SDK hex2bin
3765 cycles for 100 * HexStr2Bin
2543 cycles for 100 * HexStr2BinT (with table)
2186 cycles for 100 * HexStr2X (64-bit)
20136 cycles for 100 * MasmBasic Val
39231 cycles for 100 * CRT a2ud
3732 cycles for 100 * Masm32 SDK hex2bin
3786 cycles for 100 * HexStr2Bin
2509 cycles for 100 * HexStr2BinT (with table)
2237 cycles for 100 * HexStr2X (64-bit)
20169 cycles for 100 * MasmBasic Val
39727 cycles for 100 * CRT a2ud
5039 cycles for 100 * Masm32 SDK hex2bin
2817 cycles for 100 * HexStr2Bin
2662 cycles for 100 * HexStr2BinT (with table)
2386 cycles for 100 * HexStr2X (64-bit)
3 bytes for MasmBasic Val
19 bytes for CRT a2ud
12 bytes for Masm32 SDK hex2bin
48 bytes for HexStr2Bin
76 bytes for HexStr2BinT (with table)
128 bytes for HexStr2X (64-bit)
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT a2ud
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax HexStr2Bin
12ABCDEFh eax HexStr2BinT (with table)
12ABCDEFh eax HexStr2X (64-bit)
For a better comparison, the same string is used for all algos. Note that the 64-bit HexStr2X is now the fastest, thanks to some SSE 4.1 acrobacy - sorry for our friends with legacy CPUs :cool:
Note also that the Masm32 SDK hex2bin has a strange little bug with odd-sized strings.
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (SSE4)
15415 cycles for 100 * MasmBasic Val
27670 cycles for 100 * CRT a2ud
2401 cycles for 100 * Masm32 SDK hex2bin
2577 cycles for 100 * HexStr2Bin
1515 cycles for 100 * HexStr2BinT (with table)
1991 cycles for 100 * HexStr2X (64-bit)
15419 cycles for 100 * MasmBasic Val
27507 cycles for 100 * CRT a2ud
2387 cycles for 100 * Masm32 SDK hex2bin
2578 cycles for 100 * HexStr2Bin
1538 cycles for 100 * HexStr2BinT (with table)
1776 cycles for 100 * HexStr2X (64-bit)
15464 cycles for 100 * MasmBasic Val
27932 cycles for 100 * CRT a2ud
2384 cycles for 100 * Masm32 SDK hex2bin
2589 cycles for 100 * HexStr2Bin
1611 cycles for 100 * HexStr2BinT (with table)
1776 cycles for 100 * HexStr2X (64-bit)
15432 cycles for 100 * MasmBasic Val
27581 cycles for 100 * CRT a2ud
2393 cycles for 100 * Masm32 SDK hex2bin
2529 cycles for 100 * HexStr2Bin
1555 cycles for 100 * HexStr2BinT (with table)
1728 cycles for 100 * HexStr2X (64-bit)
15411 cycles for 100 * MasmBasic Val
27860 cycles for 100 * CRT a2ud
2441 cycles for 100 * Masm32 SDK hex2bin
2613 cycles for 100 * HexStr2Bin
1654 cycles for 100 * HexStr2BinT (with table)
1735 cycles for 100 * HexStr2X (64-bit)
3 bytes for MasmBasic Val
19 bytes for CRT a2ud
12 bytes for Masm32 SDK hex2bin
48 bytes for HexStr2Bin
76 bytes for HexStr2BinT (with table)
128 bytes for HexStr2X (64-bit)
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT a2ud
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax HexStr2Bin
12ABCDEFh eax HexStr2BinT (with table)
12ABCDEFh eax HexStr2X (64-bit)
--- ok ---
The first version ran OK, this second one tripped Windows Defender
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
26332 cycles for 100 * MasmBasic Val
49023 cycles for 100 * CRT a2ud
3193 cycles for 100 * Masm32 SDK hex2bin
6746 cycles for 100 * HexStr2Bin
3455 cycles for 100 * HexStr2BinT (with table)
10090 cycles for 100 * HexStr2X (64-bit)
25977 cycles for 100 * MasmBasic Val
42327 cycles for 100 * CRT a2ud
3540 cycles for 100 * Masm32 SDK hex2bin
3117 cycles for 100 * HexStr2Bin
1996 cycles for 100 * HexStr2BinT (with table)
7329 cycles for 100 * HexStr2X (64-bit)
29533 cycles for 100 * MasmBasic Val
43779 cycles for 100 * CRT a2ud
4857 cycles for 100 * Masm32 SDK hex2bin
5937 cycles for 100 * HexStr2Bin
3429 cycles for 100 * HexStr2BinT (with table)
7939 cycles for 100 * HexStr2X (64-bit)
41624 cycles for 100 * MasmBasic Val
50731 cycles for 100 * CRT a2ud
3926 cycles for 100 * Masm32 SDK hex2bin
6351 cycles for 100 * HexStr2Bin
2003 cycles for 100 * HexStr2BinT (with table)
7744 cycles for 100 * HexStr2X (64-bit)
36712 cycles for 100 * MasmBasic Val
42581 cycles for 100 * CRT a2ud
3201 cycles for 100 * Masm32 SDK hex2bin
6077 cycles for 100 * HexStr2Bin
2002 cycles for 100 * HexStr2BinT (with table)
7664 cycles for 100 * HexStr2X (64-bit)
3 bytes for MasmBasic Val
19 bytes for CRT a2ud
12 bytes for Masm32 SDK hex2bin
48 bytes for HexStr2Bin
76 bytes for HexStr2BinT (with table)
104 bytes for HexStr2X (64-bit)
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT a2ud
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax HexStr2Bin
12ABCDEFh eax HexStr2BinT (with table)
56789DEFh eax HexStr2X (64-bit)
-
also working on a SSE2 packed conversion while commuting on train,now at home need to run through debugger fix it
Quote from: sinsi on February 11, 2024, 12:50:53 PMthis second one tripped Windows Defender
The OS must be defended against exotic modern stuff like psllq, pextrb and pinsrb :thumbsup:
Quote from: daydreamer on February 11, 2024, 06:20:52 PMworking on a SSE2 packed conversion
What kind of conversion?
Quote from: NoCforMe on February 11, 2024, 08:52:15 AMSo Mr. Hyde, do you have anything to say about what's evolved here from your post?
FWIW, I'm currently working on 32-bit ARM code for Volume 2 of "The Art of ARM Assembly" and I will occasionally post an x86 conversion here to see if I can get some ideas for improving the ARM code. Most of the crazy SSE/AVX stuff won't translate well, but the generic x86 code is useful to look at. I would have loved to find an SSE/AVX algorithm that processes multiple characters at a time, but nothing like that appears here (that I could see, anyway).
There is considerable Apples to Oranges comparisons going on here (for example, none of the routines I've seen handle underscores in the input, so comparing my function against those is not a good comparison; likewise, MASMBasic Val does so much more, it is also an unfair comparison). Running the tests on a single input string is dangerous, to say the least. That's why I used a large number of strings as inputs to my function, that tended to hit some boundary conditions). Of course, in the real world, most input strings are going to be relatively short (probably four digits or less), so choosing a large number of longer strings (or a long string as your only input) can be misleading for certain algorithms.
Also, I rarely drop down into optimizations involving instruction scheduling or code alignment. Such code executes well on *one* CPU, not as well on other CPUs (in the same CPU family). Modern compilers (with command-line switches) do a much better job of this kind of optimization these days. I'm not say that a human couldn't beat a compiler if they really tried, I'm just saying that human probably wouldn't redo the code for every CPU possibility whereas the C programmer can just change a command-line option and get better code for a different CPU. FWIW, I back ported my ARM code to (very unstructured) C code and the compiler generated code almost identical to my hand-written code. I was then able to generate Cortex-A72 (Pi 3), -A74 (Pi 4), -A76 (Pi 5), Cortex-M7F (Teensy 4.1), Cortex-M4 (Teensy 3.2), and Cortex-M0+ (Pico) code just by changing a command-line option. Except for the Cortex-M0+ (a brain-dead instruction set), the resulting code was quite good (this was all 32-bit code, btw).
And to answer the question about "why would anyone care about the speed?"
If you're writing library code for others to call, it should be optimized for space or speed (depending on the user's requirements). I generally choose speed. Of course, for generic library code, you cannot get away with some of the algorithms posted here as they wouldn't mesh well with calling code. As I am writing my code for use in "The Art of ARM Assembly Volume 2" (32-bit code), I like to preserve all the registers, which makes the code much easier to use by assembly language programmers (especially beginners, who tend to be the ones reading my books) even if it costs a little performance. For example, I have a "print" function I call, which is a front end for the C printf function, that preserves all the registers that printf() might wipe out (a large number, considering the SSE/AVX set). Not that making printf() any faster would be noticeable (as it is *really* slow to begin with), but not having to preserve any registers around the call to print (other than possible parameters you are passing in registers) is a big win, even with the performance loss.
Cheers,
Randy Hyde
Quote from: hyder on February 14, 2024, 08:19:47 AMfor generic library code, you cannot get away with some of the algorithms posted here as they wouldn't mesh well with calling code
Hi Randy,
Can you elaborate a bit on that one? The algos posted here are normally compatible with the Windows ABI. Some of mine may not run on
very old CPUs, but they run perfectly on 99% of all machines... so I don't quite understand what you mean :cool:
Version 3: CRT sscanf is out, slightly faster strtoull is in:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
19228 cycles for 100 * MasmBasic Val
2638 cycles for 100 * Masm32 SDK hex2bin
25866 cycles for 100 * strtoull
1647 cycles for 100 * HexVal
19188 cycles for 100 * MasmBasic Val
2627 cycles for 100 * Masm32 SDK hex2bin
25486 cycles for 100 * strtoull
1713 cycles for 100 * HexVal
19315 cycles for 100 * MasmBasic Val
2649 cycles for 100 * Masm32 SDK hex2bin
25776 cycles for 100 * strtoull
1651 cycles for 100 * HexVal
19143 cycles for 100 * MasmBasic Val
2678 cycles for 100 * Masm32 SDK hex2bin
25735 cycles for 100 * strtoull
1665 cycles for 100 * HexVal
19124 cycles for 100 * MasmBasic Val
2641 cycles for 100 * Masm32 SDK hex2bin
25542 cycles for 100 * strtoull
1840 cycles for 100 * HexVal
strtoull sits in ucrtbase.dll, which might not be available on older Windows versions. The program checks for its presence, though.
OS msvcrt.dll
_strtoui64 (https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/strtoui64-wcstoui64-strtoui64-l-wcstoui64-l?view=msvc-170)
AMD Athlon(tm) II X2 220 Processor (SSE3)
42190 cycles for 100 * MasmBasic Val
4935 cycles for 100 * Masm32 SDK hex2bin
41442 cycles for 100 * strtoull
Quote from: TimoVJL on February 14, 2024, 02:31:59 PMAMD Athlon(tm) II X2 220 Processor (SSE3)
42190 cycles for 100 * MasmBasic Val
41442 cycles for 100 * strtoull
Congrats, you beat MasmBasic :thup:
An old AMD is still in use and i got it from my niece, when i bought her a HP Intel i5 laptop.
I let AMD Ryzen just rest with it's 32 GB memory in upstairs, as your AMD Athlon Gold 3150U have same kind of CPU, so don't help tests.
Quote from: jj2007 on February 14, 2024, 08:29:49 PMCongrats, you beat MasmBasic :thup:
Apparently it's no so hard :biggrin:, but not always :thumbsup:
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
20432 cycles for 100 * MasmBasic Val
3208 cycles for 100 * Masm32 SDK hex2bin
20506 cycles for 100 * strtoull
2656 cycles for 100 * HexVal
20648 cycles for 100 * MasmBasic Val
3658 cycles for 100 * Masm32 SDK hex2bin
20626 cycles for 100 * strtoull
2764 cycles for 100 * HexVal
20714 cycles for 100 * MasmBasic Val
3649 cycles for 100 * Masm32 SDK hex2bin
20098 cycles for 100 * strtoull
2762 cycles for 100 * HexVal
20720 cycles for 100 * MasmBasic Val
3628 cycles for 100 * Masm32 SDK hex2bin
20784 cycles for 100 * strtoull
2771 cycles for 100 * HexVal
20596 cycles for 100 * MasmBasic Val
3613 cycles for 100 * Masm32 SDK hex2bin
20639 cycles for 100 * strtoull
2734 cycles for 100 * HexVal
Quote from: HSE on February 14, 2024, 10:14:56 PMApparently it's no so hard :biggrin:, but not always :thumbsup:
One should not compare apples and oranges: you can throw 0x123, $123, 123h, 1234 (ordinary decimal number) or 101010101010b at MasmBasic
Val (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1202)(), and always get a correct answer - plus, for fast parsing, the number of used characters in edx
*).
The proper comparison is between
strtoull and the new
HexVal macro, which both handle hexadecimal Ascii strings; and there I see a factor 7.x ;-)
Besides, while the
strtoull docs says "long long" (IBM (https://www.ibm.com/docs/en/zos/2.4.0?topic=programs-strtoull-convert-string-unsigned-long-long): "strtoull() returns the converted unsigned long long value, represented in the string"), it doesn't say that you have to grab the long long from eax
and edx:
include \masm32\MasmBasic\MasmBasic.inc
Init
Cls 3
MbHexQ=1 ; limit Hex$ to 16 bytes
Dll "ucrtbase"
Declare strtoull, C:3
Let esi="1234567890AbCdEfh"
PrintLine "strtoull returns ", Hex$(strtoull(esi, 0, 16))
PrintLine "HexVal returns ", Hex$(HexVal(esi, 64))
MbHexQ=0 ; no limit
Inkey "HexVal returns ", Hex$(HexVal("1234567890AbCdEf1234567890AbCdEfh", 128))
EndOfCode
strtoull returns 90ABCDEF
HexVal returns 12345678 90ABCDEF
HexVal returns 12345678 90ABCDEF 12345678 90ABCDEF
*) RichMasm uses Val(): Paste 0x123, $123, 123h, 1234, 101010101010y, then select each number and hit Ctrl N.
Hi JJ,
Quote from: jj2007 on February 14, 2024, 11:03:11 PMOne should not compare apples and oranges
If MasmBasic Val() is not comparable, just don't compare :biggrin:
Apparently strtoull is less friendly than MasmBasic Val(), but can process more numeric bases. I don't even know about that function. Good investigation :thumbsup:
Quote from: HSE on February 14, 2024, 11:39:44 PMcan process more numeric bases
If I need to convert octal numbers, I'll use strtoull then, thanks.
Just to note that strtoull return edx:eax
strtoull returns in edx 12345678
strtoull returns in eax 90ABCDEF
Quote from: jj2007 on February 14, 2024, 11:03:11 PMwhile the strtoull docs says "long long" ..., it doesn't say that you have to grab the long long from eax and edx
HexVal(pStr, 64) returns a QWORD in xmm0, which is easier to handle than two volatile registers.
Best print performance to put code in Workerthread you have a whole separate set of registers and main thread prints results
But I have no knowledge of ARM SIMT works???
So easiest to port to ARM assembler is stick to use scalar gp registers? Both 32 and 64 bit?
Hi Daydreamer, you are obviously in the wrong thread...
QuoteIt really depends on the calling convention used, but typically EAX is used for 32-bit and smaller integral data types, floating point values tend to use FPU or MMX registers, and 64-bit integral types tend to use a combination of EAX and EDX instead. Then there is the issue of complex class/struct types, in which case the compiler may decide to optimize away the return value and use an extra output parameter on the call stack to pass the returned object by reference to the caller.
Does the return value always go into eax register after a method call? (https://stackoverflow.com/questions/21195910/does-the-return-value-always-go-into-eax-register-after-a-method-call)
Quote from: TimoVJL on February 15, 2024, 09:09:02 PM64-bit integral types tend to use a combination of EAX and EDX
Quote from: jj2007 on February 15, 2024, 12:46:39 AMHexVal(pStr, 64) returns a QWORD in xmm0, which is easier to handle than two volatile registers.
HexVal("1234ABCDh") returns the value in eax, which covers probably 99% of all use cases.
When I need a 64-bit value, however, xmm0 is a far better choice, at least in a library that has no problem to Print Str$(xmm0) :cool:
It's not my fault that the C/C++ family has problems with accepting xmm regs as input...
Quote from: jj2007 on February 15, 2024, 09:53:52 PMIt's not my fault that the C/C++ family has problems with accepting xmm regs as input...
There are so called standards, that keeps some features out of question ?
Also i386 was tricky for standards, as CPUs don't rule standards in common programming languages.
Quote from: TimoVJL on February 16, 2024, 08:58:32 PMThere are so called standards, that keeps some features out of question ?
You want us all to go back to the dark pre-SIMD ages?
Quote from: jj2007 on February 16, 2024, 09:25:19 PMQuote from: TimoVJL on February 16, 2024, 08:58:32 PMThere are so called standards, that keeps some features out of question ?
You want us all to go back to the dark pre-SIMD ages?
No, x64 is already for us now :biggrin:
Quote from: jj2007 on February 16, 2024, 09:25:19 PMQuote from: TimoVJL on February 16, 2024, 08:58:32 PMThere are so called standards, that keeps some features out of question ?
You want us all to go back to the dark pre-SIMD ages?
Don't forget those dark ages of non SIMD and slow cpu's and tiny ram is what started this fun asm clock cycles reduction, without it we would found it pointless to make code faster with 3+ghz cpu's and very powerful gpus
Some members still enjoy the bigger challenge 16 bit real mode dos coding is
After try coding one non SIMD converter and one SIMD version, I saw the pros and cons and now trying code a hybrid version
Quote from: daydreamer on February 22, 2024, 02:20:58 AMAfter try coding one non SIMD converter and one SIMD version, I saw the pros and cons and now trying code a hybrid version
Show me, I'm curious :biggrin:
Quote from: jj2007 on February 14, 2024, 12:45:31 PMVersion 3: CRT sscanf is out, slightly faster strtoull is in:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
19228 cycles for 100 * MasmBasic Val
2638 cycles for 100 * Masm32 SDK hex2bin
25866 cycles for 100 * strtoull
1647 cycles for 100 * HexVal
19188 cycles for 100 * MasmBasic Val
2627 cycles for 100 * Masm32 SDK hex2bin
25486 cycles for 100 * strtoull
1713 cycles for 100 * HexVal
19315 cycles for 100 * MasmBasic Val
2649 cycles for 100 * Masm32 SDK hex2bin
25776 cycles for 100 * strtoull
1651 cycles for 100 * HexVal
19143 cycles for 100 * MasmBasic Val
2678 cycles for 100 * Masm32 SDK hex2bin
25735 cycles for 100 * strtoull
1665 cycles for 100 * HexVal
19124 cycles for 100 * MasmBasic Val
2641 cycles for 100 * Masm32 SDK hex2bin
25542 cycles for 100 * strtoull
1840 cycles for 100 * HexVal
strtoull sits in ucrtbase.dll, which might not be available on older Windows versions. The program checks for its presence, though.
13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)
11628 cycles for 100 * MasmBasic Val
1082 cycles for 100 * Masm32 SDK hex2bin
6352 cycles for 100 * strtoull
847 cycles for 100 * HexVal
11620 cycles for 100 * MasmBasic Val
1079 cycles for 100 * Masm32 SDK hex2bin
6427 cycles for 100 * strtoull
858 cycles for 100 * HexVal
11619 cycles for 100 * MasmBasic Val
1070 cycles for 100 * Masm32 SDK hex2bin
6360 cycles for 100 * strtoull
844 cycles for 100 * HexVal
11600 cycles for 100 * MasmBasic Val
1109 cycles for 100 * Masm32 SDK hex2bin
6444 cycles for 100 * strtoull
857 cycles for 100 * HexVal
11669 cycles for 100 * MasmBasic Val
1090 cycles for 100 * Masm32 SDK hex2bin
6359 cycles for 100 * strtoull
858 cycles for 100 * HexVal
3 bytes for MasmBasic Val
12 bytes for Masm32 SDK hex2bin
16 bytes for strtoull
0 bytes for HexVal
12ABCDEFh eax MasmBasic Val
12ABCDEFh eax CRT sscanf
12ABCDEFh eax Masm32 SDK hex2bin
12ABCDEFh eax strtoull
12ABCDEFh eax HexVal
--- ok ---
Quote from: LiaoMi on February 22, 2024, 06:15:58 AM11619 cycles for 100 * MasmBasic Val
1070 cycles for 100 * Masm32 SDK hex2bin
6360 cycles for 100 * strtoull
844 cycles for 100 * HexVal
That's a pretty fast cpu :thumbsup:
Keep testing.
This test isn't good with variable speed CPUs and C compiler optimizations make troubles.
Tests between msvcrt.dll _strtoui64 and ucrtbase.dll strtoull are interesting.
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
int __cdecl main(void)
{
long long ll;
// long long llFrequency;
// QueryPerformanceFrequency((LARGE_INTEGER *)&llFrequency);
long long llT1, llT2;
QueryPerformanceCounter((LARGE_INTEGER *)&llT1);
for (int i=0; i<100; i++)
ll = strtoull("12ABCDEFh", NULL, 16);
QueryPerformanceCounter((LARGE_INTEGER *)&llT2);
printf("%llu\n", llT2-llT1);
HMODULE hDll = LoadLibrary("msvcrt.dll");
PROC pProc = (PROC)GetProcAddress(hDll, "_strtoui64");
QueryPerformanceCounter((LARGE_INTEGER *)&llT1);
for (int i=0; i<100; i++)
ll = pProc("12ABCDEFh", NULL, 16);
QueryPerformanceCounter((LARGE_INTEGER *)&llT2);
printf("%llu\n", llT2-llT1);
FreeLibrary(hDll);
hDll = LoadLibrary("ucrtbase.dll");
pProc = (PROC)GetProcAddress(hDll, "strtoull");
QueryPerformanceCounter((LARGE_INTEGER *)&llT1);
for (int i=0; i<100; i++)
ll = pProc("12ABCDEFh", NULL, 16);
QueryPerformanceCounter((LARGE_INTEGER *)&llT2);
printf("%llu\n", llT2-llT1);
FreeLibrary(hDll);
return 0;
}
Quote from: jj2007 on February 14, 2024, 11:20:44 AMQuote from: hyder on February 14, 2024, 08:19:47 AMfor generic library code, you cannot get away with some of the algorithms posted here as they wouldn't mesh well with calling code
Hi Randy,
Can you elaborate a bit on that one? The algos posted here are normally compatible with the Windows ABI. Some of mine may not run on very old CPUs, but they run perfectly on 99% of all machines... so I don't quite understand what you mean :cool:
I am referring to my particular code calling these routines.
For example, as an assembly programmer, I always preserve all registers I modify that don't explicitly return values. This is slower than the Intel ABI, but safer for use in assembly programs.
Cheers,
Randy Hyde