ASM for FUN NEW step #1

jj2007 · November 10, 2021, 07:59:05 AM

Quote from: hutch-- on November 10, 2021, 06:32:03 AM
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.

This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...

Let's do some timings, just for fun?

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

469     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
21894   cycles for 100 * sscanf

485     cycles for 100 * C2D_J
1582    cycles for 100 * atodw
22022   cycles for 100 * sscanf

469     cycles for 100 * C2D_J
1590    cycles for 100 * atodw
21863   cycles for 100 * sscanf

476     cycles for 100 * C2D_J
1573    cycles for 100 * atodw
21894   cycles for 100 * sscanf

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf

mineiro · November 10, 2021, 09:06:41 AM

Quote from: jj2007 on November 10, 2021, 07:59:05 AM
This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...

I was thinking something like this:

Code Select


lea esi,lookup_table
mov edx,0f0f0f0fh       ;mask, lower bits of each byte

mov ecx,"1234"          ;source
pext eax,ecx,edx        ;index=eax=00001234h, 64k lookup table *2 (element sizeof)    ;BMI2 cpuid
movzx ebx,word ptr [esi+eax*2]      ;ebx=converted value

frktons · November 10, 2021, 09:22:36 AM

I tried a similar solution like Steve suggested, but in a different logic and shape.
I used 4 X 64 elements array, and at the elements with the index = ASCII code (for "1" that is 49)
I initialized the 49th(ASCII code for "1") element of Thousand array with the value 1000 and did the same
for others value up to "9".
This array was for Thousands, the second array was for Hundreds, and The Tens, and Units.

The conversion from "1234" to its corresponding number is just a SUM Thousands(ASCII code first digit) +
SUM Hundreds(ASCII code second digit) + SUM Tens(ASCII code third digit) + SUM Units(ASCII code fourth digit).

I tried it in PowerBasic, and it is 30% faster than the standard VAL() function.

I suppose in ASM it can be much faster because we can use registers, shifts, and ADD to get the
result with arrays with the values we need: 1-9, 10-90,100-900, 1000-9000.

Thanks all for your suggestions and inspiration. It's a great place to be MASM FORUM.

frktons · November 10, 2021, 10:26:30 AM

Quote from: jj2007 on November 09, 2021, 10:13:07 PM
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units

Code Select Expand
include \masm32\include\masm32rt.inc .data x123 db "1234", 0 .code start: mov ecx, dword ptr x123 movzx eax, cl ; "1" and eax, 15 ; 1 imul eax, eax, 1000 ; 1000 movzx edx, ch ; "2" and edx, 15 ; 2 imul edx, edx, 100 ; 200 add eax, edx ; 1200 bswap ecx movzx edx, ch ; "3" and edx, 15 ; 3 imul edx, edx, 10 ; 30 add eax, edx ; 1230 and ecx, 15 ; "4" -> 4 add eax, ecx ; 1234 MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK exit end start

Hi JJ.

The solution you suggested is quite simple and elegant.
Pure 32 bit ASM code.
I suppose that the code:
AND EAX, 15 is the equivalent of SUB EAX, 48 (maybe faster?)

If we have an array with all the rounded thousands, hundreds, tens
and we use directly the ASCII code as an index to these arrays
and at the end we just add the four elements corresponding to
the four ASCII codes, could we have a faster result?

I think IMUL is slower than ADD, and we don't need the AND REG, 15.

What do you think my friend?

Enjoy

jj2007 · November 10, 2021, 12:59:38 PM

Hi mineiro & Frank,

Attached a version that you can use for your algos. TestD and TestE are free - just insert your code as demonstrated below the TestA_s: label.

No MasmBasic required, just the plain Masm32 SDK. Enjoy

mineiro · November 10, 2021, 01:23:05 PM

Thanks sir jj2007;
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

Code Select


~/.wine/drive_c/FourCharsToDword2v1$ wine FourCharsToDword.exe 
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

449	cycles for 100 * C2D_J
1499	cycles for 100 * atodw
12673	cycles for 100 * sscanf
342	cycles for 100 * mineiro
??	cycles for 100 * testE

442	cycles for 100 * C2D_J
1523	cycles for 100 * atodw
12670	cycles for 100 * sscanf
342	cycles for 100 * mineiro
??	cycles for 100 * testE

435	cycles for 100 * C2D_J
1514	cycles for 100 * atodw
12673	cycles for 100 * sscanf
344	cycles for 100 * mineiro
??	cycles for 100 * testE

435	cycles for 100 * C2D_J
1511	cycles for 100 * atodw
12891	cycles for 100 * sscanf
329	cycles for 100 * mineiro
??	cycles for 100 * testE

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
46	bytes for mineiro
2	bytes for testE

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax mineiro
2	= eax testE

frktons · November 10, 2021, 04:44:59 PM

Hi JJ, Mineiro.

As supposed, with SSE/SSE2 and more advanced opcodes we can have great results
when we work with multiple registers in parallel operations.

It will take time for me to translate into ASM my ideas, but I presume the results could
be interesting.

Ad maiora.

jj2007 · November 10, 2021, 07:23:10 PM

Quote from: mineiro on November 10, 2021, 01:23:05 PM
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

Very nice! Code size is calculated including the loop & call part but not including data, such as the 8 bytes of your total variable. On my old i5 your code is a tick slower than mine, but I see it's much faster on your i7

This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:

Code Select

movq xmm1,qword ptr [total]	;static values
pxor xmm2,xmm2

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

495     cycles for 100 * C2D_J
1640    cycles for 100 * atodw
21981   cycles for 100 * sscanf
544     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1578    cycles for 100 * atodw
21924   cycles for 100 * sscanf
550     cycles for 100 * mineiro

471     cycles for 100 * C2D_J
1579    cycles for 100 * atodw
21940   cycles for 100 * sscanf
545     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1577    cycles for 100 * atodw
22206   cycles for 100 * sscanf
545     cycles for 100 * mineiro

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
54      bytes for mineiro

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax mineiro

P.S.: There is a useE=1 at the beginning of the file. Set useE=0 to suppress the unused TestE loop.

mineiro · November 10, 2021, 09:04:43 PM

Quote from: jj2007 on November 10, 2021, 07:23:10 PM
This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
Code Select Expand
movq xmm1,qword ptr [total] ;static values pxor xmm2,xmm2

Inserting that code inside procedure I got a very close value as your code posted before.
I have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

Code Select


424	cycles for 100 * C2D_J
1504	cycles for 100 * atodw
12760	cycles for 100 * sscanf
652	cycles for 100 * mineiro   ;;converting 2 values simultaneously, all code inside procedure

I remember in past when we compete code that we avoid the use of mul/imul in favor of lea. I have tried that "first approach" posted in this topic but performance was not good in my pc (ever preserving register usage (ebx)), something like 700 ~ 720 cycles.

So, let's wait next round sir jj2007.

jj2007 · November 10, 2021, 10:40:38 PM

Quote from: mineiro on November 10, 2021, 09:04:43 PMI have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

I'm not quite sure what your algo does. The other algos get a string pointer as argument:

Code Select

	push eax
	invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
	pop eax

mineiro · November 10, 2021, 11:17:07 PM

Quote from: jj2007 on November 10, 2021, 10:40:38 PM
I'm not quite sure what your algo does. The other algos get a string pointer as argument:

Well, so we need define some rules.
Your algo does the same as mine, only reversed string being used.

Code Select


	mov eax, "4321"
	call C2D_J

Follow algo that can perform 2 conversions (min2cvt), and that one that uses lea:
I will edit this message and post other computer results:

Code Select


Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

553	cycles for 100 * C2D_J
1487	cycles for 100 * atodw
12890	cycles for 100 * sscanf
739	cycles for 100 * min2cvt
333	cycles for 100 * min1cvt
895	cycles for 100 * min_lea

562	cycles for 100 * C2D_J
1485	cycles for 100 * atodw
12871	cycles for 100 * sscanf
743	cycles for 100 * min2cvt
333	cycles for 100 * min1cvt
895	cycles for 100 * min_lea

566	cycles for 100 * C2D_J
1486	cycles for 100 * atodw
12900	cycles for 100 * sscanf
746	cycles for 100 * min2cvt
332	cycles for 100 * min1cvt
751	cycles for 100 * min_lea

561	cycles for 100 * C2D_J
1559	cycles for 100 * atodw
12603	cycles for 100 * sscanf
742	cycles for 100 * min2cvt
332	cycles for 100 * min1cvt
895	cycles for 100 * min_lea

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
77	bytes for min2cvt
44	bytes for min1cvt
92	bytes for min_lea

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt
1234	= eax min_lea

Code Select


wine FourCharsToDword.exe 
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)

561	cycles for 100 * C2D_J
1912	cycles for 100 * atodw
19520	cycles for 100 * sscanf
672	cycles for 100 * min2cvt
458	cycles for 100 * min1cvt
765	cycles for 100 * min_lea

568	cycles for 100 * C2D_J
1919	cycles for 100 * atodw
19520	cycles for 100 * sscanf
670	cycles for 100 * min2cvt
457	cycles for 100 * min1cvt
767	cycles for 100 * min_lea

561	cycles for 100 * C2D_J
1918	cycles for 100 * atodw
19488	cycles for 100 * sscanf
672	cycles for 100 * min2cvt
456	cycles for 100 * min1cvt
767	cycles for 100 * min_lea

571	cycles for 100 * C2D_J
1912	cycles for 100 * atodw
19498	cycles for 100 * sscanf
671	cycles for 100 * min2cvt
458	cycles for 100 * min1cvt
769	cycles for 100 * min_lea

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
77	bytes for min2cvt
44	bytes for min1cvt
92	bytes for min_lea

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt
1234	= eax min_lea

mineiro · November 11, 2021, 12:32:30 AM

I changed my previous code to be string pointer as argument:
Previous cycles result don't changed.
I removed lea algo.

jj2007 · November 11, 2021, 01:34:32 AM

Quote from: mineiro on November 11, 2021, 12:32:30 AM
I changed my previous code to be string pointer as argument

I'm afraid you are not passing a string pointer:

Code Select

		mov eax, "4321"		;first value to be converted
		mov edx, "1234"		;second value to be converted

mov eax, chr$("1234") would pass a string pointer, as in

Code Select

  .Repeat
	push eax
	invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
	pop eax
	dec ebx
  .Until Sign?

mineiro · November 11, 2021, 07:08:29 AM

Quote from: jj2007 on November 11, 2021, 01:34:32 AM
I'm afraid you are not passing a string pointer:
Code Select Expand
mov eax, "4321" ;first value to be converted mov edx, "1234" ;second value to be converted
mov eax, chr$("1234") would pass a string pointer, as in

My and your procedure are passing data direct loaded into eax register while other functions (atodw/crt_sscanf) are passing data pointer by stack.
RULES:
All procedures should pass a pointer to data by using stack, stdcall.
I changed C2D_J,min2cvt and min1cvt.

Code Select


wine FourCharsToDword.exe 
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

557	cycles for 100 * C2D_J
1524	cycles for 100 * atodw
12665	cycles for 100 * sscanf
767	cycles for 100 * min2cvt
429	cycles for 100 * min1cvt

579	cycles for 100 * C2D_J
1499	cycles for 100 * atodw
12649	cycles for 100 * sscanf
785	cycles for 100 * min2cvt
429	cycles for 100 * min1cvt

602	cycles for 100 * C2D_J
1496	cycles for 100 * atodw
12631	cycles for 100 * sscanf
783	cycles for 100 * min2cvt
429	cycles for 100 * min1cvt

591	cycles for 100 * C2D_J
1497	cycles for 100 * atodw
12631	cycles for 100 * sscanf
747	cycles for 100 * min2cvt
431	cycles for 100 * min1cvt

54	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
84	bytes for min2cvt
52	bytes for min1cvt

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt

jj2007 · November 11, 2021, 11:38:43 AM

Excellent, mineiro

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

467     cycles for 100 * C2D_J
1592    cycles for 100 * atodw
21900   cycles for 100 * sscanf
761     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1588    cycles for 100 * atodw
21908   cycles for 100 * sscanf
763     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21898   cycles for 100 * sscanf
768     cycles for 100 * min2cvt
549     cycles for 100 * min1cvt

468     cycles for 100 * C2D_J
1591    cycles for 100 * atodw
21924   cycles for 100 * sscanf
767     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt

The MASM Forum

News:

ASM for FUN NEW step #1

jj2007

mineiro

frktons

frktons

jj2007

mineiro

frktons

jj2007

mineiro

jj2007

mineiro

mineiro

jj2007

mineiro

jj2007