News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

ASM for FUN NEW step #1

Started by frktons, November 09, 2021, 09:04:58 AM

Previous topic - Next topic

jj2007

I managed to integrate Hutch' finite state machine into the testbed. To assemble it, the num.asm file must be in the same folder. It's still plain Masm32 SDK, MasmBasic is not required. The min1cvt algo is a clear winner :thumbsup:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

665     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21983   cycles for 100 * sscanf
766     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt
683     cycles for 100 * ConvertLUT
2292    cycles for 100 * FSM (Hutch)

690     cycles for 100 * C2D_J
1605    cycles for 100 * atodw
21970   cycles for 100 * sscanf
769     cycles for 100 * min2cvt
544     cycles for 100 * min1cvt
653     cycles for 100 * ConvertLUT
2326    cycles for 100 * FSM (Hutch)

660     cycles for 100 * C2D_J
1612    cycles for 100 * atodw
22241   cycles for 100 * sscanf
770     cycles for 100 * min2cvt
547     cycles for 100 * min1cvt
626     cycles for 100 * ConvertLUT
2263    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

hutch--

The attached zip file contains the source for the FSM, it is every number between 0000 and 9999 and will test all of them if you make the function call.

It occurs in this form.

0000 0000
0001 0001
0002 0002
0003 0003
0004 0004
............
9995 9995
9996 9996
9997 9997
9998 9998
9999 9999


Pass the first string to the integers procedure and it will return the actual integer in EAX.

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

1186    cycles for 100 * C2D_J
2428    cycles for 100 * atodw
35137   cycles for 100 * sscanf
1304    cycles for 100 * min2cvt
1059    cycles for 100 * min1cvt
712     cycles for 100 * ConvertLUT
2780    cycles for 100 * FSM (Hutch)

1189    cycles for 100 * C2D_J
2426    cycles for 100 * atodw
34990   cycles for 100 * sscanf
1305    cycles for 100 * min2cvt
1065    cycles for 100 * min1cvt
721     cycles for 100 * ConvertLUT
2738    cycles for 100 * FSM (Hutch)

1185    cycles for 100 * C2D_J
2427    cycles for 100 * atodw
34735   cycles for 100 * sscanf
1311    cycles for 100 * min2cvt
1058    cycles for 100 * min1cvt
710     cycles for 100 * ConvertLUT
3155    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

532     cycles for 100 * C2D_J
1401    cycles for 100 * atodw
25676   cycles for 100 * sscanf
523     cycles for 100 * min2cvt
435     cycles for 100 * min1cvt
615     cycles for 100 * ConvertLUT
2124    cycles for 100 * FSM (Hutch)

524     cycles for 100 * C2D_J
1371    cycles for 100 * atodw
26048   cycles for 100 * sscanf
524     cycles for 100 * min2cvt
431     cycles for 100 * min1cvt
618     cycles for 100 * ConvertLUT
2129    cycles for 100 * FSM (Hutch)

512     cycles for 100 * C2D_J
1383    cycles for 100 * atodw
25954   cycles for 100 * sscanf
517     cycles for 100 * min2cvt
434     cycles for 100 * min1cvt
615     cycles for 100 * ConvertLUT
2127    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

-
May the source be with you

TimoVJL

#48
Example how msvc C optimizer works:
int str4int( char *s)
{
char anum[5];
(*(int*)&anum) = (*(int*)s) & 0x0f0f0f0f;
return a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
}

_str4int:
  [00000000] 8B442404               mov               eax,dword ptr [esp+4]
  [00000004] 53                     push              ebx
  [00000005] 8B18                   mov               ebx,dword ptr [eax]
  [00000007] 81E30F0F0F0F           and               ebx,F0F0F0Fh
  [0000000D] 8BC3                   mov               eax,ebx
  [0000000F] C1E810                 shr               eax,10h
  [00000012] 0FB6D0                 movzx             edx,al
  [00000015] 8BC3                   mov               eax,ebx
  [00000017] C1E808                 shr               eax,8
  [0000001A] 0FB6C8                 movzx             ecx,al
  [0000001D] 8B049500000000         mov               eax,dword ptr [edx*4+_a10]
  [00000024] 03048D00000000         add               eax,dword ptr [ecx*4+_a100]
  [0000002B] 0FB6CB                 movzx             ecx,bl
  [0000002E] C1EB18                 shr               ebx,18h
  [00000031] 03048D00000000         add               eax,dword ptr [ecx*4+_a1000]
  [00000038] 03C3                   add               eax,ebx
  [0000003A] 5B                     pop               ebx
  [0000003B] C3                     ret
May the source be with you

jj2007

New version, my algo and mineiro's are somewhat faster now:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

554     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
776     cycles for 100 * min2cvt
476     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2273    cycles for 100 * FSM (Hutch)

545     cycles for 100 * C2D_J
1575    cycles for 100 * atodw
768     cycles for 100 * min2cvt
478     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2252    cycles for 100 * FSM (Hutch)

548     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
765     cycles for 100 * min2cvt
478     cycles for 100 * min1cvt (SSE3)
637     cycles for 100 * ConvertLUT
2262    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)


@Timo: I've tried the MSVC optimised version, but it's 16% slower and gives a wrong result :sad:
See below C2D_J: (or search for msvc inside the file)

mineiro

In my machine, "shr reg,16" performs better than "bswap", and respective changes after this modification. The final gain in cycles is about 010~013 cycles.
ConvertLUT can be optimized a bit more, by removing "push/pop ebx" and using ecx register instead.

These are results of last benchmark test:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

513 cycles for 100 * C2D_J
1475 cycles for 100 * atodw
687 cycles for 100 * min2cvt
463 cycles for 100 * min1cvt (SSE3)
640 cycles for 100 * ConvertLUT
1383 cycles for 100 * FSM (Hutch)

467 cycles for 100 * C2D_J
1477 cycles for 100 * atodw
696 cycles for 100 * min2cvt
456 cycles for 100 * min1cvt (SSE3)
639 cycles for 100 * ConvertLUT
1383 cycles for 100 * FSM (Hutch)

456 cycles for 100 * C2D_J
1474 cycles for 100 * atodw
702 cycles for 100 * min2cvt
503 cycles for 100 * min1cvt (SSE3)
494 cycles for 100 * ConvertLUT
1431 cycles for 100 * FSM (Hutch)

54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

daydreamer

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

453     cycles for 100 * C2D_J
1476    cycles for 100 * atodw
745     cycles for 100 * min2cvt
483     cycles for 100 * min1cvt (SSE3)
477     cycles for 100 * ConvertLUT
1397    cycles for 100 * FSM (Hutch)

457     cycles for 100 * C2D_J
1458    cycles for 100 * atodw
695     cycles for 100 * min2cvt
461     cycles for 100 * min1cvt (SSE3)
470     cycles for 100 * ConvertLUT
1366    cycles for 100 * FSM (Hutch)

450     cycles for 100 * C2D_J
1456    cycles for 100 * atodw
692     cycles for 100 * min2cvt
471     cycles for 100 * min1cvt (SSE3)
473     cycles for 100 * ConvertLUT
1374    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

-

@minerio about shr reg,16 vs bswap
with your own SSE version you should try if byte shuffle is faster?
my cpu is haswell based,so it supports avx2,anyone more than hutch that can test avx2 code?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

TimoVJL

#52
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.
May the source be with you

jj2007

Quote from: TimoVJL on November 14, 2021, 08:15:40 AM
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.

For me it didn't work, the result is 1114 instead of 1234. Please search for msvc inside FourCharsToDword.asm, it should be line 74; change line 56 to if 0 to test the msvc code. Maybe I made an error when copying & pasting your code :sad:

hutch--

To appeal to your sense of humour, I split the FSM into 4 procedures, much smaller file, got it to produce the right numbers but it was about 30% slower than the single FSM procedure.  :undecided:

jj2007

Ok, here is version 5, with Timo's MSVC "optimised" algo. On my machine, mineiro's SSE3 algo is a clear winner:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

545     cycles for 100 * C2D_J
1587    cycles for 100 * atodw
769     cycles for 100 * min2cvt
503     cycles for 100 * min1cvt (SSE3)
626     cycles for 100 * ConvertLUT
2267    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

550     cycles for 100 * C2D_J
1593    cycles for 100 * atodw
768     cycles for 100 * min2cvt
477     cycles for 100 * min1cvt (SSE3)
624     cycles for 100 * ConvertLUT
2244    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

547     cycles for 100 * C2D_J
1589    cycles for 100 * atodw
768     cycles for 100 * min2cvt
483     cycles for 100 * min1cvt (SSE3)
624     cycles for 100 * ConvertLUT
2246    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

daydreamer

Minieros also my favourite
, but for ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Really? How exactly would you do that?

ConvertLUT:
  mov edx, [eax] ; pointer to string, e.g. "1234"
  and edx, 0f0f0f0fh ; convert "1" to 1
  movzx ecx, dl ;1000s
  mov eax, t1000[ecx*4]
  movzx ecx,dh ;100s
  add eax, t100[ecx*4]
  bswap edx
  movzx ecx,dh ;10s
  add eax, t10[ecx*4]
  movzx ecx, dl ;units
  add eax, ecx
  retn

TimoVJL

Sadly test program don't work with AMD Athlon(tm) II X2 220 Processor (SSE3)
AMD Athlon(tm) II X2 220 Processor (SSE3)

909     cycles for 100 * C2D_J
2480    cycles for 100 * atodw
1301    cycles for 100 * min2cvt
May the source be with you

FORTRANS

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

556 cycles for 100 * C2D_J
1699 cycles for 100 * atodw
847 cycles for 100 * min2cvt
553 cycles for 100 * min1cvt (SSE3)
588 cycles for 100 * ConvertLUT
1597 cycles for 100 * FSM (Hutch)

552 cycles for 100 * C2D_J
1701 cycles for 100 * atodw
847 cycles for 100 * min2cvt
557 cycles for 100 * min1cvt (SSE3)
586 cycles for 100 * ConvertLUT
1596 cycles for 100 * FSM (Hutch)

554 cycles for 100 * C2D_J
1701 cycles for 100 * atodw
846 cycles for 100 * min2cvt
565 cycles for 100 * min1cvt (SSE3)
588 cycles for 100 * ConvertLUT
1587 cycles for 100 * FSM (Hutch)

54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)

--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

348 cycles for 100 * C2D_J
1080 cycles for 100 * atodw
501 cycles for 100 * min2cvt
323 cycles for 100 * min1cvt (SSE3)
359 cycles for 100 * ConvertLUT
1019 cycles for 100 * FSM (Hutch)

338 cycles for 100 * C2D_J
1051 cycles for 100 * atodw
509 cycles for 100 * min2cvt
366 cycles for 100 * min1cvt (SSE3)
385 cycles for 100 * ConvertLUT
1015 cycles for 100 * FSM (Hutch)

330 cycles for 100 * C2D_J
1119 cycles for 100 * atodw
516 cycles for 100 * min2cvt
326 cycles for 100 * min1cvt (SSE3)
376 cycles for 100 * ConvertLUT
1041 cycles for 100 * FSM (Hutch)

54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)

--- ok ---