Hi guys
cant someone test this Bin2hex i adapted from an old one ?
I rebuuld 2 of them to test:
dwtoHex_Guga_SSE
FastHex (Its similar to above, but don´t use SSE - I didn´t changed the original name, just puted my code onto it)
Both uses a huge hexadecimal table for my tests.
Note: Despite the speed, the downside is that the function is huge, because of the table.
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
4149 cycles for 100 * dw2hex
4401 cycles for 100 * Hex$ Grincheux
61610 cycles for 100 * CRT sprintf
685 cycles for 100 * Bin2Hex
785 cycles for 100 * Bin2Hex2 cx
1286 cycles for 100 * dwtoHex_Guga_SSE
1434 cycles for 100 * dwtoHex_Guga
4070 cycles for 100 * dw2hex
4397 cycles for 100 * Hex$ Grincheux
61441 cycles for 100 * CRT sprintf
698 cycles for 100 * Bin2Hex
789 cycles for 100 * Bin2Hex2 cx
1285 cycles for 100 * dwtoHex_Guga_SSE
1436 cycles for 100 * dwtoHex_Guga
4073 cycles for 100 * dw2hex
4391 cycles for 100 * Hex$ Grincheux
61415 cycles for 100 * CRT sprintf
681 cycles for 100 * Bin2Hex
786 cycles for 100 * Bin2Hex2 cx
1287 cycles for 100 * dwtoHex_Guga_SSE
1434 cycles for 100 * dwtoHex_Guga
4070 cycles for 100 * dw2hex
4392 cycles for 100 * Hex$ Grincheux
61459 cycles for 100 * CRT sprintf
683 cycles for 100 * Bin2Hex
785 cycles for 100 * Bin2Hex2 cx
1289 cycles for 100 * dwtoHex_Guga_SSE
1437 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
second run
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
4076 cycles for 100 * dw2hex
4403 cycles for 100 * Hex$ Grincheux
61353 cycles for 100 * CRT sprintf
691 cycles for 100 * Bin2Hex
795 cycles for 100 * Bin2Hex2 cx
1301 cycles for 100 * dwtoHex_Guga_SSE
1444 cycles for 100 * dwtoHex_Guga
4080 cycles for 100 * dw2hex
4401 cycles for 100 * Hex$ Grincheux
61223 cycles for 100 * CRT sprintf
708 cycles for 100 * Bin2Hex
800 cycles for 100 * Bin2Hex2 cx
1295 cycles for 100 * dwtoHex_Guga_SSE
1444 cycles for 100 * dwtoHex_Guga
4309 cycles for 100 * dw2hex
4419 cycles for 100 * Hex$ Grincheux
61311 cycles for 100 * CRT sprintf
707 cycles for 100 * Bin2Hex
812 cycles for 100 * Bin2Hex2 cx
1316 cycles for 100 * dwtoHex_Guga_SSE
1460 cycles for 100 * dwtoHex_Guga
4092 cycles for 100 * dw2hex
4413 cycles for 100 * Hex$ Grincheux
61683 cycles for 100 * CRT sprintf
708 cycles for 100 * Bin2Hex
811 cycles for 100 * Bin2Hex2 cx
1315 cycles for 100 * dwtoHex_Guga_SSE
1444 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
guga,
here are my results:
2787 cycles for 100 * dw2hex
2400 cycles for 100 * Hex$ Grincheux
41084 cycles for 100 * CRT sprintf
505 cycles for 100 * Bin2Hex
534 cycles for 100 * Bin2Hex2 cx
616 cycles for 100 * dwtoHex_Guga_SSE
658 cycles for 100 * dwtoHex_Guga
2807 cycles for 100 * dw2hex
2454 cycles for 100 * Hex$ Grincheux
41195 cycles for 100 * CRT sprintf
501 cycles for 100 * Bin2Hex
502 cycles for 100 * Bin2Hex2 cx
612 cycles for 100 * dwtoHex_Guga_SSE
659 cycles for 100 * dwtoHex_Guga
2886 cycles for 100 * dw2hex
2294 cycles for 100 * Hex$ Grincheux
40487 cycles for 100 * CRT sprintf
506 cycles for 100 * Bin2Hex
505 cycles for 100 * Bin2Hex2 cx
666 cycles for 100 * dwtoHex_Guga_SSE
694 cycles for 100 * dwtoHex_Guga
2751 cycles for 100 * dw2hex
2453 cycles for 100 * Hex$ Grincheux
40054 cycles for 100 * CRT sprintf
506 cycles for 100 * Bin2Hex
522 cycles for 100 * Bin2Hex2 cx
625 cycles for 100 * dwtoHex_Guga_SSE
662 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
I hope this has been helpful to you.
Tks guys
I´m trying to see if the algorithm is worthfull to maintain with a huge table, but it seems that on Intel, it looses speed somehow ? The original version is closer to the results from Bin2Hex, but AMD seems to handle better such huge tables. Maybe doing huge tables like that is not worthfull because at the end we are talking about a difference of less then 30%-40% of speed that also seems to depends heavily on the processor ?
From yours results, that's a bit surprising for me. I though Intel would deal better on such situations :dazzled: .
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
7155 cycles for 100 * dw2hex
2999 cycles for 100 * Hex$ Grincheux
46349 cycles for 100 * CRT sprintf
1017 cycles for 100 * Bin2Hex
854 cycles for 100 * Bin2Hex2 cx
655 cycles for 100 * dwtoHex_Guga_SSE
761 cycles for 100 * dwtoHex_Guga
7213 cycles for 100 * dw2hex
2883 cycles for 100 * Hex$ Grincheux
45477 cycles for 100 * CRT sprintf
863 cycles for 100 * Bin2Hex
893 cycles for 100 * Bin2Hex2 cx
655 cycles for 100 * dwtoHex_Guga_SSE
732 cycles for 100 * dwtoHex_Guga
7224 cycles for 100 * dw2hex
2890 cycles for 100 * Hex$ Grincheux
44463 cycles for 100 * CRT sprintf
866 cycles for 100 * Bin2Hex
815 cycles for 100 * Bin2Hex2 cx
693 cycles for 100 * dwtoHex_Guga_SSE
732 cycles for 100 * dwtoHex_Guga
7516 cycles for 100 * dw2hex
3078 cycles for 100 * Hex$ Grincheux
46002 cycles for 100 * CRT sprintf
929 cycles for 100 * Bin2Hex
880 cycles for 100 * Bin2Hex2 cx
661 cycles for 100 * dwtoHex_Guga_SSE
804 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3333 cycles for 100 * dw2hex
3600 cycles for 100 * Hex$ Grincheux
42239 cycles for 100 * CRT sprintf
540 cycles for 100 * Bin2Hex
629 cycles for 100 * Bin2Hex2 cx
736 cycles for 100 * dwtoHex_Guga_SSE
837 cycles for 100 * dwtoHex_Guga
3331 cycles for 100 * dw2hex
3600 cycles for 100 * Hex$ Grincheux
42094 cycles for 100 * CRT sprintf
565 cycles for 100 * Bin2Hex
621 cycles for 100 * Bin2Hex2 cx
732 cycles for 100 * dwtoHex_Guga_SSE
815 cycles for 100 * dwtoHex_Guga
3318 cycles for 100 * dw2hex
3602 cycles for 100 * Hex$ Grincheux
42300 cycles for 100 * CRT sprintf
555 cycles for 100 * Bin2Hex
621 cycles for 100 * Bin2Hex2 cx
829 cycles for 100 * dwtoHex_Guga_SSE
843 cycles for 100 * dwtoHex_Guga
3328 cycles for 100 * dw2hex
3603 cycles for 100 * Hex$ Grincheux
42090 cycles for 100 * CRT sprintf
539 cycles for 100 * Bin2Hex
621 cycles for 100 * Bin2Hex2 cx
732 cycles for 100 * dwtoHex_Guga_SSE
814 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
It's difficult to beat Bin2Hex (https://masm32.com/board/index.php?msg=52677) ;-)
I have an excuse... my computer is old and tired, like me. :tongue:
Seeing your results guga, compared to mine... I feel a little less old, and a little less tired. ... at least partly... :biggrin:
Some fare better on your machine, others fare better on mine
Quote from: jj2007 on July 28, 2023, 11:24:33 AMIntel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3333 cycles for 100 * dw2hex
3600 cycles for 100 * Hex$ Grincheux
42239 cycles for 100 * CRT sprintf
540 cycles for 100 * Bin2Hex
629 cycles for 100 * Bin2Hex2 cx
736 cycles for 100 * dwtoHex_Guga_SSE
837 cycles for 100 * dwtoHex_Guga
3331 cycles for 100 * dw2hex
3600 cycles for 100 * Hex$ Grincheux
42094 cycles for 100 * CRT sprintf
565 cycles for 100 * Bin2Hex
621 cycles for 100 * Bin2Hex2 cx
732 cycles for 100 * dwtoHex_Guga_SSE
815 cycles for 100 * dwtoHex_Guga
3318 cycles for 100 * dw2hex
3602 cycles for 100 * Hex$ Grincheux
42300 cycles for 100 * CRT sprintf
555 cycles for 100 * Bin2Hex
621 cycles for 100 * Bin2Hex2 cx
829 cycles for 100 * dwtoHex_Guga_SSE
843 cycles for 100 * dwtoHex_Guga
3328 cycles for 100 * dw2hex
3603 cycles for 100 * Hex$ Grincheux
42090 cycles for 100 * CRT sprintf
539 cycles for 100 * Bin2Hex
621 cycles for 100 * Bin2Hex2 cx
732 cycles for 100 * dwtoHex_Guga_SSE
814 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Hi JJ. Tks...
Those results are a bit odd for me. It seems that on AMD dwtoHex_Guga_SSE is faster then Bin2Hex and on Intel, it´s the opposite.
The difference between both are not that big, but since this new version uses a huge table, i´ll keep the good old Bin2Hex. The benefits of speed (at least on my AMD) don´t seems to compensates the final size of the function - 245848 bytes (due to the tbl size).
OK, since we seem to be racing different binary--> hex conversion routines, let me throw mine into the mix:
XlatTable DB "0123456789ABCDEF"
;====================================
; Bin2Hex()
;
; On entry,
; ECX--> buffer to write hex chars. to
; EDX = value to convert
;====================================
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Bin2Hex PROC
PUSH EBX
PUSH EDI
MOV EBX, OFFSET XlatTable
MOV EDI, ECX
MOV CL, 24 ;# bits to shift right.
MOV CH, 4 ;# of bytes to convert.
hloop: MOV EAX, EDX ;Original value.
SHR EAX, CL ;Shift byte to right.
MOV AH, AL ;Save byte.
SHR AL, 4 ;Get high nybble.
XLATB
STOSB
MOV AL, AH ;Get back byte.
AND AL, 0FH ;Get low nybble.
XLATB
STOSB
SUB CL, 8 ;Next byte over.
DEC CH
JNZ hloop
MOV BYTE PTR [EDI], 0 ;Terminate buffer.
POP EDI
POP EBX
RET
Bin2Hex ENDP
Anyone care to test it?
I did not design this at all with speed in mind (because I just don't care), but I'm curious nonetheless. I did eliminate all memory references inside the loop except for XLATB and STOSB.
My entry #2: longer table, shorter code:
XlatTable DB "000102030405060708090A0B0C0D0E0F"
DB "101112131415161718191A1B1C1D1E1F"
DB "202122232425262728292A2B2C2D2E2F"
DB "303132333435363738393A3B3C3D3E3F"
DB "404142434445464748494A4B4C4D4E4F"
DB "505152535455565758595A5B5C5D5E5F"
DB "606162636465666768696A6B6C6D6E6F"
DB "707172737475767778797A7B7C7D7E7F"
DB "808182838485868788898A8B8C8D8E8F"
DB "909192939495969798999A9B9C9D9E9F"
DB "A0A1A2A3A4A5A6A7A819AAABACADAEAF"
DB "B011B2B3B4B5B6B7B8B9BABBBCBDBEBF"
DB "C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF"
DB "D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF"
DB "E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF"
DB "F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF"
;====================================
; Bin2Hex()
;
; On entry,
; ECX--> buffer to write hex chars. to
; EDX = value to convert
;====================================
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Bin2Hex PROC
PUSH EBX
PUSH EDI
MOV EBX, OFFSET XlatTable
MOV EDI, ECX
MOV CL, 24 ;# bits to shift right.
MOV CH, 4 ;# of bytes to convert.
hloop: MOV EAX, EDX ;Original value.
SHR EAX, CL ;Shift byte to right.
AND EAX, 0FFH ;Isolate that byte.
MOV AX, [EBX + EAX*2] ;Translate byte.
STOSW
SUB CL, 8 ;Next byte over.
DEC CH
JNZ hloop
MOV BYTE PTR [EDI], 0 ;Terminate buffer.
POP EDI
POP EBX
RET
Bin2Hex ENDP
Quote from: guga on July 28, 2023, 11:40:19 AMThose results are a bit odd for me. It seems that on AMD dwtoHex_Guga_SSE is faster then Bin2Hex and on Intel, it´s the opposite.
The difference between both are not that big, but since this new version uses a huge table, i´ll keep the good old Bin2Hex. The benefits of speed (at least on my AMD) don´t seems to compensates the final size of the function - 245848 bytes (due to the tbl size).
Bin2Hex is pretty fast, indeed, and I've just spent hours trying to make it faster, without success.
Have you thought of gathering instructions such as VGATHERDPS?
Btw if table size bothers you, look at
crtHexTable3 in the source :cool:
New version with NoCforMe's #2:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3341 cycles for 100 * dw2hex
1358 cycles for 100 * NoCforMe
540 cycles for 100 * Bin2Hex
739 cycles for 100 * dwtoHex_Guga_SSE
833 cycles for 100 * dwtoHex_Guga
3336 cycles for 100 * dw2hex
1270 cycles for 100 * NoCforMe
540 cycles for 100 * Bin2Hex
734 cycles for 100 * dwtoHex_Guga_SSE
789 cycles for 100 * dwtoHex_Guga
3336 cycles for 100 * dw2hex
1271 cycles for 100 * NoCforMe
540 cycles for 100 * Bin2Hex
732 cycles for 100 * dwtoHex_Guga_SSE
1413 cycles for 100 * dwtoHex_Guga
3363 cycles for 100 * dw2hex
1270 cycles for 100 * NoCforMe
540 cycles for 100 * Bin2Hex
731 cycles for 100 * dwtoHex_Guga_SSE
790 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
QuoteBin2Hex is pretty fast, indeed, and I've just spent hours trying to make it faster, without success.
Have you thought of gathering instructions such as VGATHERDPS?
Btw if table size bothers you, look at crtHexTable3 in the source :cool:
Hi JJ
Yeah, on Intel bin2hex is pretty fast. I´ll keep with it. Seems, in fact, not necessary for me reinventing the wheel becauuse mine variation only works faster on AMD and uses a huge table at teh costr of only 20-30% of better performance. I could, hoeever make a variation of suuch functions that checks the processor, but, it will definitelly kill the general performance anyway.
About VGATHERDPS, i didn´t implemented it yet. Is it for x86 (32 bits) ? Do you have an example using such instrucions ? I´ll try to port it later when i have more free time.
There is indeed quite a difference between Intel and AMD:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
5494 cycles for 100 * dw2hex
1381 cycles for 100 * NoCforMe
675 cycles for 100 * Bin2Hex
496 cycles for 100 * dwtoHex_Guga_SSE
562 cycles for 100 * dwtoHex_Guga
5558 cycles for 100 * dw2hex
1425 cycles for 100 * NoCforMe
680 cycles for 100 * Bin2Hex
515 cycles for 100 * dwtoHex_Guga_SSE
564 cycles for 100 * dwtoHex_Guga
5536 cycles for 100 * dw2hex
1469 cycles for 100 * NoCforMe
676 cycles for 100 * Bin2Hex
466 cycles for 100 * dwtoHex_Guga_SSE
567 cycles for 100 * dwtoHex_Guga
5520 cycles for 100 * dw2hex
1438 cycles for 100 * NoCforMe
692 cycles for 100 * Bin2Hex
534 cycles for 100 * dwtoHex_Guga_SSE
579 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Quote from: guga on July 28, 2023, 10:52:54 PMAbout VGATHERDPS, i didn´t implemented it yet. Is it for x86 (32 bits) ? Do you have an example using such instrucions ? I´ll try to port it later when i have more free time.
No, I never tested it. The documentation is lousy :sad:
Another test result.
AMD Ryzen 7 3700X 8-Core Processor (SSE4)
5736 cycles for 100 * dw2hex
2189 cycles for 100 * Hex$ Grincheux
29319 cycles for 100 * CRT sprintf
360 cycles for 100 * Bin2Hex
397 cycles for 100 * Bin2Hex2 cx
499 cycles for 100 * dwtoHex_Guga_SSE
523 cycles for 100 * dwtoHex_Guga
5731 cycles for 100 * dw2hex
2479 cycles for 100 * Hex$ Grincheux
29270 cycles for 100 * CRT sprintf
363 cycles for 100 * Bin2Hex
383 cycles for 100 * Bin2Hex2 cx
497 cycles for 100 * dwtoHex_Guga_SSE
525 cycles for 100 * dwtoHex_Guga
5741 cycles for 100 * dw2hex
2483 cycles for 100 * Hex$ Grincheux
29275 cycles for 100 * CRT sprintf
365 cycles for 100 * Bin2Hex
385 cycles for 100 * Bin2Hex2 cx
494 cycles for 100 * dwtoHex_Guga_SSE
524 cycles for 100 * dwtoHex_Guga
5732 cycles for 100 * dw2hex
2245 cycles for 100 * Hex$ Grincheux
29267 cycles for 100 * CRT sprintf
362 cycles for 100 * Bin2Hex
385 cycles for 100 * Bin2Hex2 cx
496 cycles for 100 * dwtoHex_Guga_SSE
523 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
Tks Greenhorn
That´s a bit wierdi. On your AMD processor, my version produces the same results as in Intel. But, on mine processor, it seems to happens the opposite. I´m wonderin what is happenning for such different results.
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
9124 cycles for 100 * dw2hex
2980 cycles for 100 * Hex$ Grincheux
51106 cycles for 100 * CRT sprintf
864 cycles for 100 * Bin2Hex
807 cycles for 100 * Bin2Hex2 cx
623 cycles for 100 * dwtoHex_Guga_SSE
713 cycles for 100 * dwtoHex_Guga
7591 cycles for 100 * dw2hex
4216 cycles for 100 * Hex$ Grincheux
47734 cycles for 100 * CRT sprintf
899 cycles for 100 * Bin2Hex
814 cycles for 100 * Bin2Hex2 cx
657 cycles for 100 * dwtoHex_Guga_SSE
716 cycles for 100 * dwtoHex_Guga
7496 cycles for 100 * dw2hex
5396 cycles for 100 * Hex$ Grincheux
55117 cycles for 100 * CRT sprintf
936 cycles for 100 * Bin2Hex
1289 cycles for 100 * Bin2Hex2 cx
858 cycles for 100 * dwtoHex_Guga_SSE
708 cycles for 100 * dwtoHex_Guga
8734 cycles for 100 * dw2hex
2919 cycles for 100 * Hex$ Grincheux
46808 cycles for 100 * CRT sprintf
900 cycles for 100 * Bin2Hex
807 cycles for 100 * Bin2Hex2 cx
860 cycles for 100 * dwtoHex_Guga_SSE
1015 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
Quote from: jj2007 on July 28, 2023, 11:09:15 PMNo, I never tested it. The documentation is lousy :sad:
Hi JJ
I´ll try to find some more info to see if it can be implemented on x86. I found some few information (in fasm as well), that maybe could give a idea where to start later
https://board.flatassembler.net/topic.php?t=22502
https://www.felixcloutier.com/x86/vgatherdps:vgatherqps
https://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented
With jj's testbed:
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
3899 cycles for 100 * dw2hex
1692 cycles for 100 * NoCforMe
697 cycles for 100 * Bin2Hex
1224 cycles for 100 * dwtoHex_Guga_SSE
1374 cycles for 100 * dwtoHex_Guga
4106 cycles for 100 * dw2hex
1692 cycles for 100 * NoCforMe
701 cycles for 100 * Bin2Hex
1228 cycles for 100 * dwtoHex_Guga_SSE
1360 cycles for 100 * dwtoHex_Guga
4142 cycles for 100 * dw2hex
1689 cycles for 100 * NoCforMe
690 cycles for 100 * Bin2Hex
1220 cycles for 100 * dwtoHex_Guga_SSE
1357 cycles for 100 * dwtoHex_Guga
4129 cycles for 100 * dw2hex
1692 cycles for 100 * NoCforMe
690 cycles for 100 * Bin2Hex
1225 cycles for 100 * dwtoHex_Guga_SSE
1356 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
Quote from: guga on July 29, 2023, 03:07:02 AMTks Greenhorn
That´s a bit wierdi. On your AMD processor, my version produces the same results as in Intel. But, on mine processor, it seems to happens the opposite. I´m wonderin what is happenning for such different results.
I guess it's due to optimization of the Ryzen Arch with each generation.
Would be interesting to see how Ryzen 5000 and 7000 performs.
Bin2HexTimingsNew2:
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
4671 cycles for 100 * dw2hex
1920 cycles for 100 * Hex$ Grincheux
22845 cycles for 100 * CRT sprintf
361 cycles for 100 * Bin2Hex
368 cycles for 100 * Bin2Hex2 cx
465 cycles for 100 * dwtoHex_Guga_SSE
475 cycles for 100 * dwtoHex_Guga
4671 cycles for 100 * dw2hex
1917 cycles for 100 * Hex$ Grincheux
23276 cycles for 100 * CRT sprintf
364 cycles for 100 * Bin2Hex
368 cycles for 100 * Bin2Hex2 cx
480 cycles for 100 * dwtoHex_Guga_SSE
454 cycles for 100 * dwtoHex_Guga
4698 cycles for 100 * dw2hex
1923 cycles for 100 * Hex$ Grincheux
22942 cycles for 100 * CRT sprintf
362 cycles for 100 * Bin2Hex
357 cycles for 100 * Bin2Hex2 cx
431 cycles for 100 * dwtoHex_Guga_SSE
438 cycles for 100 * dwtoHex_Guga
4682 cycles for 100 * dw2hex
1933 cycles for 100 * Hex$ Grincheux
23104 cycles for 100 * CRT sprintf
362 cycles for 100 * Bin2Hex
367 cycles for 100 * Bin2Hex2 cx
430 cycles for 100 * dwtoHex_Guga_SSE
439 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Bin2HexTimingsNew2JJ:
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
4650 cycles for 100 * dw2hex
1359 cycles for 100 * NoCforMe
355 cycles for 100 * Bin2Hex
445 cycles for 100 * dwtoHex_Guga_SSE
451 cycles for 100 * dwtoHex_Guga
4657 cycles for 100 * dw2hex
1385 cycles for 100 * NoCforMe
355 cycles for 100 * Bin2Hex
485 cycles for 100 * dwtoHex_Guga_SSE
473 cycles for 100 * dwtoHex_Guga
4659 cycles for 100 * dw2hex
1375 cycles for 100 * NoCforMe
355 cycles for 100 * Bin2Hex
434 cycles for 100 * dwtoHex_Guga_SSE
436 cycles for 100 * dwtoHex_Guga
4660 cycles for 100 * dw2hex
1376 cycles for 100 * NoCforMe
355 cycles for 100 * Bin2Hex
466 cycles for 100 * dwtoHex_Guga_SSE
437 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Quote from: guga on July 29, 2023, 03:09:31 AMQuote from: jj2007 on July 28, 2023, 11:09:15 PMNo, I never tested it. The documentation is lousy :sad:
Hi JJ
I´ll try to find some more info to see if it can be implemented on x86. I found some few information (in fasm as well), that maybe could give a idea where to start later
https://board.flatassembler.net/topic.php?t=22502
https://www.felixcloutier.com/x86/vgatherdps:vgatherqps
https://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented
At least, now I got the syntax right. This assembles with UAsm64:
mov eax, offset sometext
int 3
nop
vpgatherdd xmm0, [eax+xmm1], xmm2
nop
movd edx, xmm0
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
6854 cycles for 100 * dw2hex
2959 cycles for 100 * Hex$ Grincheux
44193 cycles for 100 * CRT sprintf
911 cycles for 100 * Bin2Hex
780 cycles for 100 * Bin2Hex2 cx
620 cycles for 100 * dwtoHex_Guga_SSE
691 cycles for 100 * dwtoHex_Guga
6836 cycles for 100 * dw2hex
2805 cycles for 100 * Hex$ Grincheux
43706 cycles for 100 * CRT sprintf
834 cycles for 100 * Bin2Hex
778 cycles for 100 * Bin2Hex2 cx
621 cycles for 100 * dwtoHex_Guga_SSE
692 cycles for 100 * dwtoHex_Guga
7309 cycles for 100 * dw2hex
2890 cycles for 100 * Hex$ Grincheux
45566 cycles for 100 * CRT sprintf
837 cycles for 100 * Bin2Hex
849 cycles for 100 * Bin2Hex2 cx
625 cycles for 100 * dwtoHex_Guga_SSE
694 cycles for 100 * dwtoHex_Guga
7069 cycles for 100 * dw2hex
2887 cycles for 100 * Hex$ Grincheux
45794 cycles for 100 * CRT sprintf
835 cycles for 100 * Bin2Hex
923 cycles for 100 * Bin2Hex2 cx
626 cycles for 100 * dwtoHex_Guga_SSE
689 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
64 bytes for Hex$ Grincheux
29 bytes for CRT sprintf
138 bytes for Bin2Hex
150 bytes for Bin2Hex2 cx
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
00345678 = eax dw2hex
12345678 = eax Hex$ Grincheux
345678 = eax CRT sprintf
12345678 = eax Bin2Hex
00345678 = eax Bin2Hex2 cx
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
-
One more, please...
There is one Avx2 algo. If you run it on an older cpu, it will simply output "No Avx2" (I hope that works...)
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
5509 cycles for 100 * dw2hex
1372 cycles for 100 * NoCforMe
662 cycles for 100 * Bin2Hex
1082 cycles for 100 * Bin2Hex Avx2
653 cycles for 100 * dwtoHex_Guga_SSE
546 cycles for 100 * dwtoHex_Guga
5480 cycles for 100 * dw2hex
1373 cycles for 100 * NoCforMe
676 cycles for 100 * Bin2Hex
1074 cycles for 100 * Bin2Hex Avx2
495 cycles for 100 * dwtoHex_Guga_SSE
546 cycles for 100 * dwtoHex_Guga
5481 cycles for 100 * dw2hex
1381 cycles for 100 * NoCforMe
702 cycles for 100 * Bin2Hex
1085 cycles for 100 * Bin2Hex Avx2
495 cycles for 100 * dwtoHex_Guga_SSE
546 cycles for 100 * dwtoHex_Guga
5436 cycles for 100 * dw2hex
1520 cycles for 100 * NoCforMe
660 cycles for 100 * Bin2Hex
1074 cycles for 100 * Bin2Hex Avx2
495 cycles for 100 * dwtoHex_Guga_SSE
573 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
198 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Bin2HexTimingsNewG:
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
4111 cycles for 100 * dw2hex
1678 cycles for 100 * NoCforMe
1236 cycles for 100 * Bin2Hex
23921 cycles for 100 * Bin2Hex Avx2
1243 cycles for 100 * dwtoHex_Guga_SSE
1360 cycles for 100 * dwtoHex_Guga
4067 cycles for 100 * dw2hex
1678 cycles for 100 * NoCforMe
1242 cycles for 100 * Bin2Hex
23994 cycles for 100 * Bin2Hex Avx2
1294 cycles for 100 * dwtoHex_Guga_SSE
1334 cycles for 100 * dwtoHex_Guga
4079 cycles for 100 * dw2hex
1696 cycles for 100 * NoCforMe
1131 cycles for 100 * Bin2Hex
23931 cycles for 100 * Bin2Hex Avx2
1243 cycles for 100 * dwtoHex_Guga_SSE
1342 cycles for 100 * dwtoHex_Guga
4070 cycles for 100 * dw2hex
1678 cycles for 100 * NoCforMe
1236 cycles for 100 * Bin2Hex
23971 cycles for 100 * Bin2Hex Avx2
1239 cycles for 100 * dwtoHex_Guga_SSE
1364 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
198 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
No Avx2 = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
No Avx2 = eax Bin2Hex Avx2 :tongue:
Bin2HexTimingsNewG:
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
4677 cycles for 100 * dw2hex
1368 cycles for 100 * NoCforMe
356 cycles for 100 * Bin2Hex
18422 cycles for 100 * Bin2Hex Avx2
442 cycles for 100 * dwtoHex_Guga_SSE
436 cycles for 100 * dwtoHex_Guga
4670 cycles for 100 * dw2hex
1374 cycles for 100 * NoCforMe
361 cycles for 100 * Bin2Hex
18377 cycles for 100 * Bin2Hex Avx2
428 cycles for 100 * dwtoHex_Guga_SSE
442 cycles for 100 * dwtoHex_Guga
4703 cycles for 100 * dw2hex
1370 cycles for 100 * NoCforMe
355 cycles for 100 * Bin2Hex
18341 cycles for 100 * Bin2Hex Avx2
432 cycles for 100 * dwtoHex_Guga_SSE
436 cycles for 100 * dwtoHex_Guga
4731 cycles for 100 * dw2hex
1373 cycles for 100 * NoCforMe
390 cycles for 100 * Bin2Hex
18299 cycles for 100 * Bin2Hex Avx2
464 cycles for 100 * dwtoHex_Guga_SSE
439 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
198 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Ok, there was a little glitch. Corrected version attached, it even builds without MasmBasic.
As you can see, the Avx2 version is not particularly fast. However, it was fun to code it, and I almost understood one of the exotic gather instructions ;-)
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
5513 cycles for 100 * dw2hex
1383 cycles for 100 * NoCforMe
674 cycles for 100 * Bin2Hex
1077 cycles for 100 * Bin2Hex Avx2
493 cycles for 100 * dwtoHex_Guga_SSE
565 cycles for 100 * dwtoHex_Guga
5470 cycles for 100 * dw2hex
1377 cycles for 100 * NoCforMe
672 cycles for 100 * Bin2Hex
1075 cycles for 100 * Bin2Hex Avx2
501 cycles for 100 * dwtoHex_Guga_SSE
549 cycles for 100 * dwtoHex_Guga
5439 cycles for 100 * dw2hex
1375 cycles for 100 * NoCforMe
668 cycles for 100 * Bin2Hex
1077 cycles for 100 * Bin2Hex Avx2
490 cycles for 100 * dwtoHex_Guga_SSE
551 cycles for 100 * dwtoHex_Guga
5543 cycles for 100 * dw2hex
1377 cycles for 100 * NoCforMe
716 cycles for 100 * Bin2Hex
1088 cycles for 100 * Bin2Hex Avx2
491 cycles for 100 * dwtoHex_Guga_SSE
555 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
222 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
4713 cycles for 100 * dw2hex
1381 cycles for 100 * NoCforMe
358 cycles for 100 * Bin2Hex
18324 cycles for 100 * Bin2Hex Avx2
433 cycles for 100 * dwtoHex_Guga_SSE
452 cycles for 100 * dwtoHex_Guga
4699 cycles for 100 * dw2hex
1376 cycles for 100 * NoCforMe
357 cycles for 100 * Bin2Hex
18193 cycles for 100 * Bin2Hex Avx2
432 cycles for 100 * dwtoHex_Guga_SSE
438 cycles for 100 * dwtoHex_Guga
4716 cycles for 100 * dw2hex
1380 cycles for 100 * NoCforMe
357 cycles for 100 * Bin2Hex
18365 cycles for 100 * Bin2Hex Avx2
430 cycles for 100 * dwtoHex_Guga_SSE
438 cycles for 100 * dwtoHex_Guga
4695 cycles for 100 * dw2hex
1376 cycles for 100 * NoCforMe
357 cycles for 100 * Bin2Hex
18265 cycles for 100 * Bin2Hex Avx2
428 cycles for 100 * dwtoHex_Guga_SSE
450 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
222 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
Wow, this is weird - and it's not an old CPU (https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+9+5950X&id=3862) :rolleyes:
358 cycles for 100 * Bin2Hex
18324 cycles for 100 * Bin2Hex Avx2
Bin2HexTimingsM32
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
4101 cycles for 100 * dw2hex
1689 cycles for 100 * NoCforMe
698 cycles for 100 * Bin2Hex
1187 cycles for 100 * Bin2Hex Avx2
1223 cycles for 100 * dwtoHex_Guga_SSE
1357 cycles for 100 * dwtoHex_Guga
4099 cycles for 100 * dw2hex
1689 cycles for 100 * NoCforMe
691 cycles for 100 * Bin2Hex
1187 cycles for 100 * Bin2Hex Avx2
1218 cycles for 100 * dwtoHex_Guga_SSE
1356 cycles for 100 * dwtoHex_Guga
4101 cycles for 100 * dw2hex
1689 cycles for 100 * NoCforMe
782 cycles for 100 * Bin2Hex
1191 cycles for 100 * Bin2Hex Avx2
1219 cycles for 100 * dwtoHex_Guga_SSE
1358 cycles for 100 * dwtoHex_Guga
4093 cycles for 100 * dw2hex
1701 cycles for 100 * NoCforMe
714 cycles for 100 * Bin2Hex
1187 cycles for 100 * Bin2Hex Avx2
1218 cycles for 100 * dwtoHex_Guga_SSE
1351 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
222 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
No Avx2! = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
--- ok ---
Quote from: jj2007 on July 30, 2023, 08:10:32 AMWow, this is weird - and it's not an old CPU (https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+9+5950X&id=3862) :rolleyes:
358 cycles for 100 * Bin2Hex
18324 cycles for 100 * Bin2Hex Avx2
Maybe AVX2 module power up on demand.
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
6837 cycles for 100 * dw2hex
1796 cycles for 100 * NoCforMe
811 cycles for 100 * Bin2Hex
1406 cycles for 100 * Bin2Hex Avx2
602 cycles for 100 * dwtoHex_Guga_SSE
675 cycles for 100 * dwtoHex_Guga
6862 cycles for 100 * dw2hex
1729 cycles for 100 * NoCforMe
813 cycles for 100 * Bin2Hex
1406 cycles for 100 * Bin2Hex Avx2
597 cycles for 100 * dwtoHex_Guga_SSE
740 cycles for 100 * dwtoHex_Guga
6835 cycles for 100 * dw2hex
1709 cycles for 100 * NoCforMe
885 cycles for 100 * Bin2Hex
1419 cycles for 100 * Bin2Hex Avx2
604 cycles for 100 * dwtoHex_Guga_SSE
706 cycles for 100 * dwtoHex_Guga
6841 cycles for 100 * dw2hex
1792 cycles for 100 * NoCforMe
853 cycles for 100 * Bin2Hex
1323 cycles for 100 * Bin2Hex Avx2
608 cycles for 100 * dwtoHex_Guga_SSE
669 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex
572 bytes for NoCforMe
138 bytes for Bin2Hex
222 bytes for Bin2Hex Avx2
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex
12345678 = eax NoCforMe
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
-
Quote from: TimoVJL on July 30, 2023, 03:05:17 PMQuote from: jj2007 on July 30, 2023, 08:10:32 AMWow, this is weird - and it's not an old CPU (https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+9+5950X&id=3862) :rolleyes:
358 cycles for 100 * Bin2Hex
18324 cycles for 100 * Bin2Hex Avx2
Maybe AVX2 module power up on demand.
I'm wondering how is the program calculating cycles on my machine, I have no AVX2 apparently. But looking at the cycle count results, my computer is doing something to get those results... or is that from just running the 'overhead' code? guga? jj2007?
QuoteI'm wondering how is the program calculating cycles on my machine, I have no AVX2 apparently. But looking at the cycle count results, my computer is doing something to get those results... or is that from just running the 'overhead' code? guga? jj2007?
Have a look at the source ;-)
; fall through
MbHxA2 proc
mov edx, offset MbHexTable ; edx
cmp byte ptr [edx], "0"
jne MbHxCT ; create the table
Quote from: TimoVJL on July 31, 2023, 05:18:29 PMAMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
...
1406 cycles for 100 * Bin2Hex Avx2
...
How did you achieved this result, Timo ?
Maybe it's time to close this thread - over 100x faster than the CRT (>170x on AMD) should be enough, right?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3363 cycles for 100 * dw2hex (Masm32 SDK)
1255 cycles for 100 * NoCforMe
52270 cycles for 100 * CRT sprintf
525 cycles for 100 * Bin2Hex
1900 cycles for 100 * Bin2Hex Avx2
443 cycles for 100 * Bin2Hex JJ
782 cycles for 100 * dwtoHex_Guga_SSE
818 cycles for 100 * dwtoHex_Guga
3339 cycles for 100 * dw2hex (Masm32 SDK)
1255 cycles for 100 * NoCforMe
52685 cycles for 100 * CRT sprintf
525 cycles for 100 * Bin2Hex
1900 cycles for 100 * Bin2Hex Avx2
443 cycles for 100 * Bin2Hex JJ
780 cycles for 100 * dwtoHex_Guga_SSE
817 cycles for 100 * dwtoHex_Guga
3331 cycles for 100 * dw2hex (Masm32 SDK)
1254 cycles for 100 * NoCforMe
52593 cycles for 100 * CRT sprintf
525 cycles for 100 * Bin2Hex
1899 cycles for 100 * Bin2Hex Avx2
443 cycles for 100 * Bin2Hex JJ
781 cycles for 100 * dwtoHex_Guga_SSE
816 cycles for 100 * dwtoHex_Guga
3332 cycles for 100 * dw2hex (Masm32 SDK)
1255 cycles for 100 * NoCforMe
52551 cycles for 100 * CRT sprintf
525 cycles for 100 * Bin2Hex
1899 cycles for 100 * Bin2Hex Avx2
443 cycles for 100 * Bin2Hex JJ
772 cycles for 100 * dwtoHex_Guga_SSE
818 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex (Masm32 SDK)
572 bytes for NoCforMe
29 bytes for CRT sprintf
130 bytes for Bin2Hex
182 bytes for Bin2Hex Avx2
168 bytes for Bin2Hex JJ
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex (Masm32 SDK)
12345678 = eax NoCforMe
12345678 = eax CRT sprintf
12345678 = eax Bin2Hex
No Avx2! = eax Bin2Hex Avx2
12345678 = eax Bin2Hex JJ
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
5536 cycles for 100 * dw2hex (Masm32 SDK)
1390 cycles for 100 * NoCforMe
47836 cycles for 100 * CRT sprintf
680 cycles for 100 * Bin2Hex
1085 cycles for 100 * Bin2Hex Avx2
287 cycles for 100 * Bin2Hex JJ
493 cycles for 100 * dwtoHex_Guga_SSE
603 cycles for 100 * dwtoHex_Guga
5470 cycles for 100 * dw2hex (Masm32 SDK)
1405 cycles for 100 * NoCforMe
48769 cycles for 100 * CRT sprintf
671 cycles for 100 * Bin2Hex
1074 cycles for 100 * Bin2Hex Avx2
287 cycles for 100 * Bin2Hex JJ
494 cycles for 100 * dwtoHex_Guga_SSE
547 cycles for 100 * dwtoHex_Guga
5461 cycles for 100 * dw2hex (Masm32 SDK)
1428 cycles for 100 * NoCforMe
48496 cycles for 100 * CRT sprintf
669 cycles for 100 * Bin2Hex
1085 cycles for 100 * Bin2Hex Avx2
273 cycles for 100 * Bin2Hex JJ
490 cycles for 100 * dwtoHex_Guga_SSE
547 cycles for 100 * dwtoHex_Guga
5452 cycles for 100 * dw2hex (Masm32 SDK)
1413 cycles for 100 * NoCforMe
47613 cycles for 100 * CRT sprintf
668 cycles for 100 * Bin2Hex
1098 cycles for 100 * Bin2Hex Avx2
273 cycles for 100 * Bin2Hex JJ
490 cycles for 100 * dwtoHex_Guga_SSE
578 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex (Masm32 SDK)
572 bytes for NoCforMe
29 bytes for CRT sprintf
130 bytes for Bin2Hex
182 bytes for Bin2Hex Avx2
168 bytes for Bin2Hex JJ
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex (Masm32 SDK)
12345678 = eax NoCforMe
12345678 = eax CRT sprintf
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax Bin2Hex JJ
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
4696 cycles for 100 * dw2hex (Masm32 SDK)
1379 cycles for 100 * NoCforMe
26732 cycles for 100 * CRT sprintf
362 cycles for 100 * Bin2Hex
18474 cycles for 100 * Bin2Hex Avx2
587 cycles for 100 * Bin2Hex JJ
428 cycles for 100 * dwtoHex_Guga_SSE
439 cycles for 100 * dwtoHex_Guga
4699 cycles for 100 * dw2hex (Masm32 SDK)
1382 cycles for 100 * NoCforMe
26661 cycles for 100 * CRT sprintf
360 cycles for 100 * Bin2Hex
18560 cycles for 100 * Bin2Hex Avx2
284 cycles for 100 * Bin2Hex JJ
428 cycles for 100 * dwtoHex_Guga_SSE
440 cycles for 100 * dwtoHex_Guga
4708 cycles for 100 * dw2hex (Masm32 SDK)
1380 cycles for 100 * NoCforMe
26639 cycles for 100 * CRT sprintf
360 cycles for 100 * Bin2Hex
18588 cycles for 100 * Bin2Hex Avx2
284 cycles for 100 * Bin2Hex JJ
429 cycles for 100 * dwtoHex_Guga_SSE
439 cycles for 100 * dwtoHex_Guga
4699 cycles for 100 * dw2hex (Masm32 SDK)
1378 cycles for 100 * NoCforMe
26673 cycles for 100 * CRT sprintf
366 cycles for 100 * Bin2Hex
18880 cycles for 100 * Bin2Hex Avx2
289 cycles for 100 * Bin2Hex JJ
430 cycles for 100 * dwtoHex_Guga_SSE
440 cycles for 100 * dwtoHex_Guga
20 bytes for dw2hex (Masm32 SDK)
572 bytes for NoCforMe
29 bytes for CRT sprintf
130 bytes for Bin2Hex
182 bytes for Bin2Hex Avx2
168 bytes for Bin2Hex JJ
245848 bytes for dwtoHex_Guga_SSE
76 bytes for dwtoHex_Guga
12345678 = eax dw2hex (Masm32 SDK)
12345678 = eax NoCforMe
12345678 = eax CRT sprintf
12345678 = eax Bin2Hex
12345678 = eax Bin2Hex Avx2
12345678 = eax Bin2Hex JJ
12345678 = eax dwtoHex_Guga_SSE
12345678 = eax dwtoHex_Guga