News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

ASM for FUN NEW step #1

Started by frktons, November 09, 2021, 09:04:58 AM

Previous topic - Next topic

jj2007

Quote from: hutch-- on November 10, 2021, 06:32:03 AM
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.

This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...

Let's do some timings, just for fun?  :biggrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

469     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
21894   cycles for 100 * sscanf

485     cycles for 100 * C2D_J
1582    cycles for 100 * atodw
22022   cycles for 100 * sscanf

469     cycles for 100 * C2D_J
1590    cycles for 100 * atodw
21863   cycles for 100 * sscanf

476     cycles for 100 * C2D_J
1573    cycles for 100 * atodw
21894   cycles for 100 * sscanf

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf

mineiro

Quote from: jj2007 on November 10, 2021, 07:59:05 AM
This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...
I was thinking something like this:

lea esi,lookup_table
mov edx,0f0f0f0fh       ;mask, lower bits of each byte

mov ecx,"1234"          ;source
pext eax,ecx,edx        ;index=eax=00001234h, 64k lookup table *2 (element sizeof)    ;BMI2 cpuid
movzx ebx,word ptr [esi+eax*2]      ;ebx=converted value
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

frktons

I tried a similar solution like Steve suggested, but in a different logic and shape.
I used 4 X 64 elements array, and at the elements with the index = ASCII code (for "1" that is 49)
I initialized the 49th(ASCII code for "1") element of Thousand array with the value 1000 and did the same
for others value up to "9".
This array was for Thousands, the second array was for Hundreds, and The Tens, and Units.

The conversion from "1234" to its corresponding number is just a SUM Thousands(ASCII code first digit) +
SUM Hundreds(ASCII code second digit) + SUM Tens(ASCII code third digit) + SUM Units(ASCII code fourth digit).

I tried it in PowerBasic, and it is 30% faster than the standard VAL() function.

I suppose in ASM it can be much faster because we can use registers, shifts, and ADD to get the
result with arrays with the values we need: 1-9, 10-90,100-900, 1000-9000.

Thanks all for your suggestions and inspiration. It's a great place to be MASM FORUM.

:thumbsup:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Quote from: jj2007 on November 09, 2021, 10:13:07 PM
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value  has to be multiplied for 10
EDX contains the units

include \masm32\include\masm32rt.inc

.data
x123 db "1234", 0

.code
start:
  mov ecx, dword ptr x123
  movzx eax, cl ; "1"
  and eax, 15 ; 1
  imul eax, eax, 1000 ; 1000
  movzx edx, ch ; "2"
  and edx, 15 ; 2
  imul edx, edx, 100 ; 200
  add eax, edx ; 1200
  bswap ecx
  movzx edx, ch ; "3"
  and edx, 15 ; 3
  imul edx, edx, 10 ; 30
  add eax, edx ; 1230
  and ecx, 15 ; "4" -> 4
  add eax, ecx ; 1234
  MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK
  exit
end start


Hi JJ.

The solution you suggested is quite simple and elegant.
Pure 32 bit ASM code.
I suppose that the code:
AND EAX, 15 is the equivalent of SUB EAX, 48 (maybe faster?)

If we have an array with all the rounded thousands, hundreds, tens
and we use directly the ASCII code as an index to these arrays
and at the end we just add the four elements corresponding to
the four ASCII codes, could we have a faster result?

I think IMUL is slower than ADD, and we don't need the AND REG, 15.

What do you think my friend?

Enjoy

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Hi mineiro & Frank,

Attached a version that you can use for your algos. TestD and TestE are free - just insert your code as demonstrated below the TestA_s: label.

No MasmBasic required, just the plain Masm32 SDK. Enjoy :thumbsup:

mineiro

Thanks sir jj2007;
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

~/.wine/drive_c/FourCharsToDword2v1$ wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

449 cycles for 100 * C2D_J
1499 cycles for 100 * atodw
12673 cycles for 100 * sscanf
342 cycles for 100 * mineiro
?? cycles for 100 * testE

442 cycles for 100 * C2D_J
1523 cycles for 100 * atodw
12670 cycles for 100 * sscanf
342 cycles for 100 * mineiro
?? cycles for 100 * testE

435 cycles for 100 * C2D_J
1514 cycles for 100 * atodw
12673 cycles for 100 * sscanf
344 cycles for 100 * mineiro
?? cycles for 100 * testE

435 cycles for 100 * C2D_J
1511 cycles for 100 * atodw
12891 cycles for 100 * sscanf
329 cycles for 100 * mineiro
?? cycles for 100 * testE

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
46 bytes for mineiro
2 bytes for testE

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax mineiro
2 = eax testE


I'd rather be this ambulant metamorphosis than to have that old opinion about everything

frktons

Hi JJ, Mineiro.  :eusa_clap:

As supposed, with SSE/SSE2 and more advanced opcodes we can have great results
when we work with multiple registers in parallel operations.

It will take time for me to translate into ASM my ideas, but I presume the results could
be interesting.

Ad maiora.

:thumbsup:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: mineiro on November 10, 2021, 01:23:05 PM
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

Very nice! Code size is calculated including the loop & call part but not including data, such as the 8 bytes of your total variable. On my old i5 your code is a tick slower than mine, but I see it's much faster on your i7 :thumbsup:

This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
movq xmm1,qword ptr [total] ;static values
pxor xmm2,xmm2


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

495     cycles for 100 * C2D_J
1640    cycles for 100 * atodw
21981   cycles for 100 * sscanf
544     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1578    cycles for 100 * atodw
21924   cycles for 100 * sscanf
550     cycles for 100 * mineiro

471     cycles for 100 * C2D_J
1579    cycles for 100 * atodw
21940   cycles for 100 * sscanf
545     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1577    cycles for 100 * atodw
22206   cycles for 100 * sscanf
545     cycles for 100 * mineiro

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
54      bytes for mineiro

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax mineiro


P.S.: There is a useE=1 at the beginning of the file. Set useE=0 to suppress the unused TestE loop.

mineiro

Quote from: jj2007 on November 10, 2021, 07:23:10 PM
This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
movq xmm1,qword ptr [total] ;static values
pxor xmm2,xmm2

Inserting that code inside procedure I got a very close value as your code posted before.
I have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

424 cycles for 100 * C2D_J
1504 cycles for 100 * atodw
12760 cycles for 100 * sscanf
652 cycles for 100 * mineiro   ;;converting 2 values simultaneously, all code inside procedure


I remember in past when we compete code that we avoid the use of mul/imul in favor of lea. I have tried that "first approach" posted in this topic but performance was not good in my pc (ever preserving register usage (ebx)), something like 700 ~ 720 cycles.

So, let's wait next round sir jj2007.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: mineiro on November 10, 2021, 09:04:43 PMI have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

I'm not quite sure what your algo does. The other algos get a string pointer as argument:

push eax
invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
pop eax

mineiro

Quote from: jj2007 on November 10, 2021, 10:40:38 PM
I'm not quite sure what your algo does. The other algos get a string pointer as argument:
Well, so we need define some rules.
Your algo does the same as mine, only reversed string being used.

mov eax, "4321"
call C2D_J


Follow algo that can perform 2 conversions (min2cvt), and that one that uses lea:
I will edit this message and post other computer results:

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

553 cycles for 100 * C2D_J
1487 cycles for 100 * atodw
12890 cycles for 100 * sscanf
739 cycles for 100 * min2cvt
333 cycles for 100 * min1cvt
895 cycles for 100 * min_lea

562 cycles for 100 * C2D_J
1485 cycles for 100 * atodw
12871 cycles for 100 * sscanf
743 cycles for 100 * min2cvt
333 cycles for 100 * min1cvt
895 cycles for 100 * min_lea

566 cycles for 100 * C2D_J
1486 cycles for 100 * atodw
12900 cycles for 100 * sscanf
746 cycles for 100 * min2cvt
332 cycles for 100 * min1cvt
751 cycles for 100 * min_lea

561 cycles for 100 * C2D_J
1559 cycles for 100 * atodw
12603 cycles for 100 * sscanf
742 cycles for 100 * min2cvt
332 cycles for 100 * min1cvt
895 cycles for 100 * min_lea

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
77 bytes for min2cvt
44 bytes for min1cvt
92 bytes for min_lea

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax min_lea




wine FourCharsToDword.exe
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)

561 cycles for 100 * C2D_J
1912 cycles for 100 * atodw
19520 cycles for 100 * sscanf
672 cycles for 100 * min2cvt
458 cycles for 100 * min1cvt
765 cycles for 100 * min_lea

568 cycles for 100 * C2D_J
1919 cycles for 100 * atodw
19520 cycles for 100 * sscanf
670 cycles for 100 * min2cvt
457 cycles for 100 * min1cvt
767 cycles for 100 * min_lea

561 cycles for 100 * C2D_J
1918 cycles for 100 * atodw
19488 cycles for 100 * sscanf
672 cycles for 100 * min2cvt
456 cycles for 100 * min1cvt
767 cycles for 100 * min_lea

571 cycles for 100 * C2D_J
1912 cycles for 100 * atodw
19498 cycles for 100 * sscanf
671 cycles for 100 * min2cvt
458 cycles for 100 * min1cvt
769 cycles for 100 * min_lea

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
77 bytes for min2cvt
44 bytes for min1cvt
92 bytes for min_lea

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax min_lea

I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

I changed my previous code to be string pointer as argument:
Previous cycles result don't changed.
I removed lea algo.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: mineiro on November 11, 2021, 12:32:30 AM
I changed my previous code to be string pointer as argument

I'm afraid you are not passing a string pointer:
mov eax, "4321" ;first value to be converted
mov edx, "1234" ;second value to be converted


mov eax, chr$("1234") would pass a string pointer, as in
  .Repeat
push eax
invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
pop eax
dec ebx
  .Until Sign?

mineiro

Quote from: jj2007 on November 11, 2021, 01:34:32 AM
I'm afraid you are not passing a string pointer:
mov eax, "4321" ;first value to be converted
mov edx, "1234" ;second value to be converted

mov eax, chr$("1234") would pass a string pointer, as in

My and your procedure are passing data direct loaded into eax register while other functions (atodw/crt_sscanf) are passing data pointer by stack.
RULES:
All procedures should pass a pointer to data by using stack, stdcall.
I changed C2D_J,min2cvt and min1cvt.


wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

557 cycles for 100 * C2D_J
1524 cycles for 100 * atodw
12665 cycles for 100 * sscanf
767 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt

579 cycles for 100 * C2D_J
1499 cycles for 100 * atodw
12649 cycles for 100 * sscanf
785 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt

602 cycles for 100 * C2D_J
1496 cycles for 100 * atodw
12631 cycles for 100 * sscanf
783 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt

591 cycles for 100 * C2D_J
1497 cycles for 100 * atodw
12631 cycles for 100 * sscanf
747 cycles for 100 * min2cvt
431 cycles for 100 * min1cvt

54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt

I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Excellent, mineiro :thumbsup:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

467     cycles for 100 * C2D_J
1592    cycles for 100 * atodw
21900   cycles for 100 * sscanf
761     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1588    cycles for 100 * atodw
21908   cycles for 100 * sscanf
763     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21898   cycles for 100 * sscanf
768     cycles for 100 * min2cvt
549     cycles for 100 * min1cvt

468     cycles for 100 * C2D_J
1591    cycles for 100 * atodw
21924   cycles for 100 * sscanf
767     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt