News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

ASM for FUN NEW step #1

Started by frktons, November 09, 2021, 09:04:58 AM

Previous topic - Next topic

raymond

I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.
I wrote some code to prepare such an LUT (once which should take only a few nanosecs) then used it to convert an ascii number 1,000,000,000 times (10^9). The timing on my older laptop was 1.5 secs!

Code for preparing LUT
.data
      t10   dd    10 DUP(0)
      t100  dd    10 DUP(0)
      t1000 dd    10 DUP(0)

      numbers     db    "3578"

start:

;fill LUTs of multiples
;of 10
      lea   edi,t10
      xor   eax,eax
      mov   edx,10
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;of 100
      xor   eax,eax
      mov   edx,100
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;of 1000
      xor   eax,eax
      mov   edx,1000
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;there is NO need to have a look-up table for multiples of units!!!


Then I ran the following code for converting ascii to binary, using the QueryPerformanceFrequency function for the timing

      mov   ecx,1000000000
start1:

      lea   esi,numbers
      mov   edx,[esi]
      and   edx,0f0f0f0fh
      movzx ebx,dl            ;1000s
      mov   eax,t1000[ebx*4]
      movzx ebx,dh            ;100s
      add   eax,t100[ebx*4]
      bswap edx
      movzx ebx,dh            ;10s
      add   eax,t10[ebx*4]
      movzx ebx,dl            ;units
      add   eax,ebx

      dec   ecx
      jnz   start1
Whenever you assume something, you risk being wrong half the time.
http://www.ray.masmcode.com

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

833     cycles for 100 * C2D_J
2410    cycles for 100 * atodw
34220   cycles for 100 * sscanf
1296    cycles for 100 * min2cvt
1057    cycles for 100 * min1cvt

806     cycles for 100 * C2D_J
2437    cycles for 100 * atodw
34498   cycles for 100 * sscanf
1297    cycles for 100 * min2cvt
1053    cycles for 100 * min1cvt

805     cycles for 100 * C2D_J
2436    cycles for 100 * atodw
34558   cycles for 100 * sscanf
1297    cycles for 100 * min2cvt
1055    cycles for 100 * min1cvt

805     cycles for 100 * C2D_J
2410    cycles for 100 * atodw
34503   cycles for 100 * sscanf
1296    cycles for 100 * min2cvt
1060    cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt

--- ok ---
LUT with Pelles C1: #include <stdio.h>
2: int a10[] = {0,10,20,30,40,50,60,70,80,90};
3: int a100[] = {0,100,200,300,400,500,600,700,800,900};
4: int a1000[] = {0,1000,2000,3000,4000,5000,6000,7000,8000,9000};
5:
6: int __cdecl main(void)

_main:
  [00000000] 55                     push              ebp
  [00000001] 89E5                   mov               ebp,esp
  [00000003] 83EC08                 sub               esp,8
7: {
8: char anum[] = "1234";
  [00000006] C745FB31323334         mov               dword ptr [ebp-5],34333231
  [0000000D] C645FF00               mov               byte ptr [ebp-1],0
9: int num;
10:
11: (*(int*)&anum) &= 0x0f0f0f0f;
  [00000011] 8165FB0F0F0F0F         and               dword ptr [ebp-5],F0F0F0F
12: num = a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
  [00000018] 0FBE45FB               movsx             eax,byte ptr [ebp-5]
  [0000001C] 8B048500000000         mov               eax,dword ptr [eax*4+_a1000]
  [00000023] 0FBE55FC               movsx             edx,byte ptr [ebp-4]
  [00000027] 03049500000000         add               eax,dword ptr [edx*4+_a100]
  [0000002E] 0FBE55FD               movsx             edx,byte ptr [ebp-3]
  [00000032] 03049500000000         add               eax,dword ptr [edx*4+_a10]
  [00000039] 0FBE55FE               movsx             edx,byte ptr [ebp-2]
  [0000003D] 01D0                   add               eax,edx
13: printf("%u\n", num);
  [0000003F] 50                     push              eax
  [00000040] 6800000000             push              @6
  [00000045] E800000000             call              _printf
  [0000004A] 83C408                 add               esp,8
14: return 0;
  [0000004D] 31C0                   xor               eax,eax
  [0000004F] 89EC                   mov               esp,ebp
  [00000051] 5D                     pop               ebp
  [00000052] C3                     ret               
using temporary string
char anum2[] = "1234";
char anum[5];
int num;

12: (*(int*)&anum) = (*(int*)&anum2) & 0x0f0f0f0f;
  [00000011] 8B45FB                 mov               eax,dword ptr [ebp-5]
  [00000014] 250F0F0F0F             and               eax,F0F0F0F
  [00000019] 8945F6                 mov               dword ptr [ebp-A],eax
May the source be with you

jj2007

Quote from: raymond on November 11, 2021, 03:18:06 PM
I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.

Thank you, Raymond. Here it is as ConvertLUT:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

475     cycles for 100 * C2D_J
1597    cycles for 100 * atodw
22224   cycles for 100 * sscanf
772     cycles for 100 * min2cvt
555     cycles for 100 * min1cvt
640     cycles for 100 * ConvertLUT

497     cycles for 100 * C2D_J
1595    cycles for 100 * atodw
22401   cycles for 100 * sscanf
800     cycles for 100 * min2cvt
584     cycles for 100 * min1cvt
631     cycles for 100 * ConvertLUT

489     cycles for 100 * C2D_J
1598    cycles for 100 * atodw
22796   cycles for 100 * sscanf
776     cycles for 100 * min2cvt
551     cycles for 100 * min1cvt
640     cycles for 100 * ConvertLUT

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT

mineiro

Good job sir raymond and sir TimoVJL;
I created a static aligned LUT table to your procedures in data section.
I passed data pointer by stack, so, inserted a few modifications to deal with this in both codes.

raymond; when I changed bswap in your code by shr and respective changes it performed a bit better in my machine.
I'm leaving original code in testcase, only changed ebx to ecx register usage to follow register preservations without need of push/pop ebx.

TimoVJL; your code suffered performance because value being passed by stack instead of data pointer being passed by stack. I know this can be done inside our code, but just to follow rules. I have tried to insert same overhead in all procedures so result should be honest.
After "and 0f0f0f0fh" I inserted value back to stack. So, I confess it's not the best try because data is in register and go back to stack and read stack data digits each time. If you know better modifications please tell.

I have tried to not touch in yours code, only adapt to windows rules.

sir jj2007; in your testcase some procedures are getting data pointer from register while other procedures are receiving that from stack. This is creating different overheads. In first case it's necessary "mov reg,[reg]" while in other "mov reg,[esp+4] mov reg,[reg]", and at end of procedure first procedure do only "retn" while other "ret 4". Suggestions?


wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

562 cycles for 100 * C2D_J
1482 cycles for 100 * atodw
12649 cycles for 100 * sscanf
752 cycles for 100 * min2cvt
477 cycles for 100 * min1cvt
471 cycles for 100 * ray_LUT
741 cycles for 100 * TimoVJL_PellesC_LUT

638 cycles for 100 * C2D_J
1480 cycles for 100 * atodw
12835 cycles for 100 * sscanf
756 cycles for 100 * min2cvt
433 cycles for 100 * min1cvt
476 cycles for 100 * ray_LUT
744 cycles for 100 * TimoVJL_PellesC_LUT

564 cycles for 100 * C2D_J
1505 cycles for 100 * atodw
12630 cycles for 100 * sscanf
778 cycles for 100 * min2cvt
427 cycles for 100 * min1cvt
470 cycles for 100 * ray_LUT
729 cycles for 100 * TimoVJL_PellesC_LUT

561 cycles for 100 * C2D_J
1478 cycles for 100 * atodw
12558 cycles for 100 * sscanf
751 cycles for 100 * min2cvt
432 cycles for 100 * min1cvt
465 cycles for 100 * ray_LUT
730 cycles for 100 * TimoVJL_PellesC_LUT

54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt
62 bytes for ray_LUT
70 bytes for TimoVJL_PellesC_LUT

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ray_LUT
1234 = eax TimoVJL_PellesC_LUT


PS: sir jj2007 told me in pm that my executables are not running from GUI, so, you must open console to run; or if someone can create an executable direct using windows just post that. I'm doing these thing using linux. Appreciate.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

These are results from jj2007 testcase:

wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

479 cycles for 100 * C2D_J
1528 cycles for 100 * atodw
12745 cycles for 100 * sscanf
725 cycles for 100 * min2cvt
493 cycles for 100 * min1cvt
502 cycles for 100 * ConvertLUT

478 cycles for 100 * C2D_J
1610 cycles for 100 * atodw
12623 cycles for 100 * sscanf
745 cycles for 100 * min2cvt
495 cycles for 100 * min1cvt
501 cycles for 100 * ConvertLUT

550 cycles for 100 * C2D_J
1483 cycles for 100 * atodw
12590 cycles for 100 * sscanf
741 cycles for 100 * min2cvt
581 cycles for 100 * min1cvt
500 cycles for 100 * ConvertLUT

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT

I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

Thanks for the tip sir jj2007;
I checked command line when building executable and subsystem:WINDOWS was default; I changed that to CONSOLE.
Link version that I was using is v8.???, changed that to Version 5.12.8078.
So, I suppose now can work by double clicking.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:

min1cvt proc _data:dword
option prologue:none
option epilogue:none

mov ecx,[esp+4]
mov eax,[ecx]
xor eax,"0000"      ;and eax,0f0f0f0fh
movd xmm0,eax       ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1   ;mul and add
;---------------------------
;db 66h,0fh,38h,02h,0c0h ;          phaddd xmm0,xmm0 ;SSE3
;movd eax,xmm0

movd eax,xmm0
psrlq xmm0,32
movd ecx,xmm0
add eax,ecx
ret 4
min1cvt endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

655     cycles for 100 * C2D_J
1605    cycles for 100 * atodw
22178   cycles for 100 * sscanf
785     cycles for 100 * min2cvt
475     cycles for 100 * min1cvt
599     cycles for 100 * ray_LUT
661     cycles for 100 * TimoVJL_PellesC_LUT

651     cycles for 100 * C2D_J
1599    cycles for 100 * atodw
22808   cycles for 100 * sscanf
823     cycles for 100 * min2cvt
508     cycles for 100 * min1cvt
607     cycles for 100 * ray_LUT
717     cycles for 100 * TimoVJL_PellesC_LUT

714     cycles for 100 * C2D_J
1615    cycles for 100 * atodw
22657   cycles for 100 * sscanf
778     cycles for 100 * min2cvt
488     cycles for 100 * min1cvt
587     cycles for 100 * ray_LUT
657     cycles for 100 * TimoVJL_PellesC_LUT

696     cycles for 100 * C2D_J
1613    cycles for 100 * atodw
22193   cycles for 100 * sscanf
778     cycles for 100 * min2cvt
487     cycles for 100 * min1cvt
595     cycles for 100 * ray_LUT
664     cycles for 100 * TimoVJL_PellesC_LUT

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
84      bytes for min2cvt
52      bytes for min1cvt
62      bytes for ray_LUT
70      bytes for TimoVJL_PellesC_LUT

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ray_LUT
1234    = eax TimoVJL_PellesC_LUT


It's a pity that this is limited to the exotic test case "four digits" :sad:

daydreamer

Quote from: jj2007 on November 12, 2021, 01:03:01 AM
It's a pity that this is limited to the exotic test case "four digits" :sad:
its not,thanks everyone for testing different approaches to fast ascii->integer :thumbsup:
there is still fastest ascii->double and fastest ascii->64bit integer
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

mineiro

Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.

I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax       ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

daydreamer

Quote from: mineiro on November 12, 2021, 06:13:25 AM
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.

I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax       ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...
change to newer assembler than ml 6.14 solves it without need for macros

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

mineiro

Quote from: daydreamer on November 12, 2021, 06:54:11 AM
change to newer assembler than ml 6.14 solves it without need for macros
Thanks sir daydreamer.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

frktons

That's great Guys.

I am happy to see many options are coming out and the speed of the code is quite nice.  :thumbsup:

Compliments to everyone partecipating in the game.  :eusa_clap:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

hutch--

Here is another approach, a finite state machine, AKA look up tree. 4 character strings fed in, integers fed out. Its big but its reasonably fast. Timings below are with my old i7.

Output

2
9999
5432
8619
9876
2345
1234
7776
8193
4352
test return values - done

timing 10 million matches next, press any key ....

ms = 109 duration
Press any key to continue ...

jj2007

Finite state machine, AKA look up tree vs my algo. I hope I didn't make any mistakes in translation. The attachment does not contain the num.asm file - see Hutch' post.

hutch--

 :thumbsup:

That is genuinely fast and it looks like its producing the right output numbers.