The MASM Forum

General => The Campus => Topic started by: frktons on November 09, 2021, 09:04:58 AM

Title: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 09:04:58 AM
Hello.

I have just created a starting point for validating the ALGO to convert
some millions strings into numbers.

Here is the example I used in PowerBasic, that is considered quite a good Compiler,
and the results I got. From here the ASM versions should start and confront the speed
against it.

The Code I used, quite simple:

QuoteSUB Convert
   
    TIX CycleCount
   
    FOR Counter = 1 TO Indice
   
       MyNumbers(Counter) = VAL(MyStrings(Counter))
   
    NEXT
   
    TIX END CycleCount
   
END SUB   

The strings are 7,110,268 in the string array.
The CPU tix used: 2,607,804,194

Let's see how we can speed it up with some ASM code.  :eusa_dance:

The file that I use for the test is quite big > 300 MB. They are real data from
S&P 500 quotations from 1997 up to now with one minute lapse.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 09, 2021, 10:53:52 AM
Try the sval("123") macro, it's part of masm32rt.inc:

include \masm32\include\masm32rt.inc

.data
x123 db "-12345", 0

.code
start:
  MsgBox 0, cat$(str$(sval(offset x123)), " is the value"), "val is great:", MB_OK
  exit

end start


MasmBasic Val(a string) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1201) is probably a bit slower (it's an allrounder that reads dec, hex, bin in various flavours).
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 09, 2021, 01:11:54 PM
This algo only deals with non signed numbers.

;first approach
;"1234" == 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0 == 1000+200+30+4
mov eax,"1234"          ;30303030h to 39393939h
xor eax,"0000"          ;sub "0000"
xor ecx,ecx
xor edx,edx
xor ebx,ebx

shld ecx,eax,8          ;"1"
shl eax,8               ;"234"
lea ebx,[ecx*8]         ;1*8=8
lea ebx,[ebx+ecx*2]     ;8+(1*2)=10

xor ecx,ecx
shld ecx,eax,8          ;"2"
shl eax,8               ;"34"
add ebx,ecx             ;10+2=12
lea edx,[ebx*8]         ;12*8=96
lea edx,[edx+ebx*2]     ;96+(12*2)=120

xor ecx,ecx
xor ebx,ebx
shld ecx,eax,8          ;"3"
shl eax,8               ;"4"
add edx,ecx             ;120+3=123
lea ebx,[edx*8]         ;123*8=984
lea ebx,[ebx+edx*2]     ;984+(123*2)=1230

xor ecx,ecx
shld ecx,eax,8          ;"4"
add ebx,ecx             ;1230+4=1234
;ebx="1234" converted to binary equals to 4d2h

You can create a lookup table with all possible combinations from 0000 to 9999 instead of do calculus of each number each time. Will have a execution time gain in 1 million attempts.
Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 01:59:03 PM
Quote from: jj2007 on November 09, 2021, 10:53:52 AM
Try the sval("123") macro, it's part of masm32rt.inc:

include \masm32\include\masm32rt.inc

.data
x123 db "-12345", 0

.code
start:
  MsgBox 0, cat$(str$(sval(offset x123)), " is the value"), "val is great:", MB_OK
  exit

end start


MasmBasic Val(a string) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1201) is probably a bit slower (it's an allrounder that reads dec, hex, bin in various flavours).

For didactical purpose I'd like to use just pure and simple ASM opcodes, registers, jumps, moves, adds and so on.
Using a MACRO or a FUNCTION that somebody else has assembled can just be done to test the efficiency and speed
but it's not fully satisfying for my thirst of ASM knowledge  :-)

Moreover I can't find the sval() macro in MASM32 help, so I can't see what is it and how it works.

Thanks anyway JJ for your suggestions. :-)

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 02:00:50 PM
Quote from: mineiro on November 09, 2021, 01:11:54 PM
This algo only deals with non signed numbers.

;first approach
;"1234" == 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0 == 1000+200+30+4
mov eax,"1234"          ;30303030h to 39393939h
xor eax,"0000"          ;sub "0000"
xor ecx,ecx
xor edx,edx
xor ebx,ebx

shld ecx,eax,8          ;"1"
shl eax,8               ;"234"
lea ebx,[ecx*8]         ;1*8=8
lea ebx,[ebx+ecx*2]     ;8+(1*2)=10

xor ecx,ecx
shld ecx,eax,8          ;"2"
shl eax,8               ;"34"
add ebx,ecx             ;10+2=12
lea edx,[ebx*8]         ;12*8=96
lea edx,[edx+ebx*2]     ;96+(12*2)=120

xor ecx,ecx
xor ebx,ebx
shld ecx,eax,8          ;"3"
shl eax,8               ;"4"
add edx,ecx             ;120+3=123
lea ebx,[edx*8]         ;123*8=984
lea ebx,[ebx+edx*2]     ;984+(123*2)=1230

xor ecx,ecx
shld ecx,eax,8          ;"4"
add ebx,ecx             ;1230+4=1234
;ebx="1234" converted to binary equals to 4d2h

You can create a lookup table with all possible combinations from 0000 to 9999 instead of do calculus of each number each time. Will have a execution time gain in 1 million attempts.

Yes Mineiro, this is the kind of reasoning I'm doing myself.
Just need time to regain my ability to use the MASM opcodes.

My data set has not negative numbers, all of them are not signed.

Thanks
Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 02:24:00 PM
I am thinking about an ASCII approach to the conversion from string to number.
If I have the string number "4735", every digit has an ASCII representation, and value:

4 is ASCII 52
7 is ASCII 55
3 is ASCII 51
5 is ASCII 53

The four bytes in memory have these ASCII codes 52,55,51,53.

If I move these 4 bytes into four registers: EAX, EBX, ECX, EDX
and subtract from each the value 48, I have the value

4 in EAX
7 in EBX
3 in ECX
5 in EDX

EAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value  has to be multiplied for 10
EDX contains the units

the sum of the contents of the 4 registers is the number that is represented in the ASCII string.

I don't know if this approach is efficient for speed, but the logic should work.

If there are MMX or SSE opcodes that do in parallel the work on all the four bytes/registers the
efficiency should be better. But I am not aware at the moment if these opcodes exist.



Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 09, 2021, 09:41:54 PM
Quote from: frktons on November 09, 2021, 02:24:00 PM
I am thinking about an ASCII approach to the conversion from string to number.
If I have the string number "4735", every digit has an ASCII representation, and value:

4 is ASCII 52
7 is ASCII 55
3 is ASCII 51
5 is ASCII 53

The four bytes in memory have these ASCII codes 52,55,51,53.

If I move these 4 bytes into four registers: EAX, EBX, ECX, EDX
and subtract from each the value 48, I have the value

4 in EAX
7 in EBX
3 in ECX
5 in EDX

EAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value  has to be multiplied for 10
EDX contains the units

the sum of the contents of the 4 registers is the number that is represented in the ASCII string.

I don't know if this approach is efficient for speed, but the logic should work.

If there are MMX or SSE opcodes that do in parallel the work on all the four bytes/registers the
efficiency should be better. But I am not aware at the moment if these opcodes exist.
you can get ascii numbers being " 123" too instead of "0123",so I prefer AND to mask out 32+16 instead of sub

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 09:48:13 PM
Yes daydreamer, I think it is an efficient way of getting the same result.

I am not used to using these tricks, and my ASM is a bit rusty from long inactivity.

But I understand the point. Thanks

In the data set I am using there are no spaces, but leading "0" if the case.
There are no cases of leading zero, all the values are between 1000 and 5000 indeed.

:thumbsup:
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 09, 2021, 10:13:07 PM
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value  has to be multiplied for 10
EDX contains the units

include \masm32\include\masm32rt.inc

.data
x123 db "1234", 0

.code
start:
  mov ecx, dword ptr x123
  movzx eax, cl ; "1"
  and eax, 15 ; 1
  imul eax, eax, 1000 ; 1000
  movzx edx, ch ; "2"
  and edx, 15 ; 2
  imul edx, edx, 100 ; 200
  add eax, edx ; 1200
  bswap ecx
  movzx edx, ch ; "3"
  and edx, 15 ; 3
  imul edx, edx, 10 ; 30
  add eax, edx ; 1230
  and ecx, 15 ; "4" -> 4
  add eax, ecx ; 1234
  MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK
  exit
end start
Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 10:18:52 PM
Hi JJ.

Nice solution using only one register and its one byte segments.

Great idea, compliment my friend.

I was thinking about something similar, but I was wondering if Full 32 bit registers are faster
doing these operations. Maybe not. Need to check it out.

:thumbsup:
Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 09, 2021, 11:32:05 PM
What about a lookup table with 0000 to 9999 members ? Only 40k.
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 01:15:12 AM
This is a variation from daydreamer code.

.data
align 8
total dw 1,10,100,1000
.code
movd xmm1,qword ptr [total]
pxor xmm2,xmm2

mov eax,"1234"      ;mov eax,dword ptr []
                    ;bswap eax
xor eax,"0000"      ;and eax,0f0f0f0fh
movd xmm0,eax       ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1   ;mul and add
phaddd xmm0,xmm0    ;add
movd eax,xmm0       ;eax=4d2h
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 02:33:45 AM
Quote from: hutch-- on November 09, 2021, 11:32:05 PM
What about a lookup table with 0000 to 9999 members ? Only 40k.

How would you look up, for example, "1234"? Conversion is from Ascii to binary...
Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 10, 2021, 06:32:03 AM
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 10, 2021, 06:44:17 AM
Quote from: mineiro on November 10, 2021, 01:15:12 AM
This is a variation from daydreamer code.

.data
align 8
total dw 1,10,100,1000
.code
movd xmm1,qword ptr [total]
pxor xmm2,xmm2

mov eax,"1234"      ;mov eax,dword ptr []
                    ;bswap eax
xor eax,"0000"      ;and eax,0f0f0f0fh
movd xmm0,eax       ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1   ;mul and add
phaddd xmm0,xmm0    ;add
movd eax,xmm0       ;eax=4d2h

great :thumbsup:
actually my code was just first to get it working right,after that make ascii->double and double->ascii and possible 128bit and 256bit
@Hutch
LUT works fast,how fast if there is a prime among first 1-65535 LUT used code somewhere in forum

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 07:59:05 AM
Quote from: hutch-- on November 10, 2021, 06:32:03 AM
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.

This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...

Let's do some timings, just for fun?  :biggrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

469     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
21894   cycles for 100 * sscanf

485     cycles for 100 * C2D_J
1582    cycles for 100 * atodw
22022   cycles for 100 * sscanf

469     cycles for 100 * C2D_J
1590    cycles for 100 * atodw
21863   cycles for 100 * sscanf

476     cycles for 100 * C2D_J
1573    cycles for 100 * atodw
21894   cycles for 100 * sscanf

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 09:06:41 AM
Quote from: jj2007 on November 10, 2021, 07:59:05 AM
This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...
I was thinking something like this:

lea esi,lookup_table
mov edx,0f0f0f0fh       ;mask, lower bits of each byte

mov ecx,"1234"          ;source
pext eax,ecx,edx        ;index=eax=00001234h, 64k lookup table *2 (element sizeof)    ;BMI2 cpuid
movzx ebx,word ptr [esi+eax*2]      ;ebx=converted value
Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 10, 2021, 09:22:36 AM
I tried a similar solution like Steve suggested, but in a different logic and shape.
I used 4 X 64 elements array, and at the elements with the index = ASCII code (for "1" that is 49)
I initialized the 49th(ASCII code for "1") element of Thousand array with the value 1000 and did the same
for others value up to "9".
This array was for Thousands, the second array was for Hundreds, and The Tens, and Units.

The conversion from "1234" to its corresponding number is just a SUM Thousands(ASCII code first digit) +
SUM Hundreds(ASCII code second digit) + SUM Tens(ASCII code third digit) + SUM Units(ASCII code fourth digit).

I tried it in PowerBasic, and it is 30% faster than the standard VAL() function.

I suppose in ASM it can be much faster because we can use registers, shifts, and ADD to get the
result with arrays with the values we need: 1-9, 10-90,100-900, 1000-9000.

Thanks all for your suggestions and inspiration. It's a great place to be MASM FORUM.

:thumbsup:
Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 10, 2021, 10:26:30 AM
Quote from: jj2007 on November 09, 2021, 10:13:07 PM
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value  has to be multiplied for 10
EDX contains the units

include \masm32\include\masm32rt.inc

.data
x123 db "1234", 0

.code
start:
  mov ecx, dword ptr x123
  movzx eax, cl ; "1"
  and eax, 15 ; 1
  imul eax, eax, 1000 ; 1000
  movzx edx, ch ; "2"
  and edx, 15 ; 2
  imul edx, edx, 100 ; 200
  add eax, edx ; 1200
  bswap ecx
  movzx edx, ch ; "3"
  and edx, 15 ; 3
  imul edx, edx, 10 ; 30
  add eax, edx ; 1230
  and ecx, 15 ; "4" -> 4
  add eax, ecx ; 1234
  MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK
  exit
end start


Hi JJ.

The solution you suggested is quite simple and elegant.
Pure 32 bit ASM code.
I suppose that the code:
AND EAX, 15 is the equivalent of SUB EAX, 48 (maybe faster?)

If we have an array with all the rounded thousands, hundreds, tens
and we use directly the ASCII code as an index to these arrays
and at the end we just add the four elements corresponding to
the four ASCII codes, could we have a faster result?

I think IMUL is slower than ADD, and we don't need the AND REG, 15.

What do you think my friend?

Enjoy

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 12:59:38 PM
Hi mineiro & Frank,

Attached a version that you can use for your algos. TestD and TestE are free - just insert your code as demonstrated below the TestA_s: label.

No MasmBasic required, just the plain Masm32 SDK. Enjoy :thumbsup:
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 01:23:05 PM
Thanks sir jj2007;
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

~/.wine/drive_c/FourCharsToDword2v1$ wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

449 cycles for 100 * C2D_J
1499 cycles for 100 * atodw
12673 cycles for 100 * sscanf
342 cycles for 100 * mineiro
?? cycles for 100 * testE

442 cycles for 100 * C2D_J
1523 cycles for 100 * atodw
12670 cycles for 100 * sscanf
342 cycles for 100 * mineiro
?? cycles for 100 * testE

435 cycles for 100 * C2D_J
1514 cycles for 100 * atodw
12673 cycles for 100 * sscanf
344 cycles for 100 * mineiro
?? cycles for 100 * testE

435 cycles for 100 * C2D_J
1511 cycles for 100 * atodw
12891 cycles for 100 * sscanf
329 cycles for 100 * mineiro
?? cycles for 100 * testE

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
46 bytes for mineiro
2 bytes for testE

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax mineiro
2 = eax testE


Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 10, 2021, 04:44:59 PM
Hi JJ, Mineiro.  :eusa_clap:

As supposed, with SSE/SSE2 and more advanced opcodes we can have great results
when we work with multiple registers in parallel operations.

It will take time for me to translate into ASM my ideas, but I presume the results could
be interesting.

Ad maiora.

:thumbsup:
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 07:23:10 PM
Quote from: mineiro on November 10, 2021, 01:23:05 PM
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

Very nice! Code size is calculated including the loop & call part but not including data, such as the 8 bytes of your total variable. On my old i5 your code is a tick slower than mine, but I see it's much faster on your i7 :thumbsup:

This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
movq xmm1,qword ptr [total] ;static values
pxor xmm2,xmm2


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

495     cycles for 100 * C2D_J
1640    cycles for 100 * atodw
21981   cycles for 100 * sscanf
544     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1578    cycles for 100 * atodw
21924   cycles for 100 * sscanf
550     cycles for 100 * mineiro

471     cycles for 100 * C2D_J
1579    cycles for 100 * atodw
21940   cycles for 100 * sscanf
545     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1577    cycles for 100 * atodw
22206   cycles for 100 * sscanf
545     cycles for 100 * mineiro

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
54      bytes for mineiro

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax mineiro


P.S.: There is a useE=1 at the beginning of the file. Set useE=0 to suppress the unused TestE loop.
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 09:04:43 PM
Quote from: jj2007 on November 10, 2021, 07:23:10 PM
This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
movq xmm1,qword ptr [total] ;static values
pxor xmm2,xmm2

Inserting that code inside procedure I got a very close value as your code posted before.
I have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

424 cycles for 100 * C2D_J
1504 cycles for 100 * atodw
12760 cycles for 100 * sscanf
652 cycles for 100 * mineiro   ;;converting 2 values simultaneously, all code inside procedure


I remember in past when we compete code that we avoid the use of mul/imul in favor of lea. I have tried that "first approach" posted in this topic but performance was not good in my pc (ever preserving register usage (ebx)), something like 700 ~ 720 cycles.

So, let's wait next round sir jj2007.
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 10:40:38 PM
Quote from: mineiro on November 10, 2021, 09:04:43 PMI have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

I'm not quite sure what your algo does. The other algos get a string pointer as argument:

push eax
invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
pop eax
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 11:17:07 PM
Quote from: jj2007 on November 10, 2021, 10:40:38 PM
I'm not quite sure what your algo does. The other algos get a string pointer as argument:
Well, so we need define some rules.
Your algo does the same as mine, only reversed string being used.

mov eax, "4321"
call C2D_J


Follow algo that can perform 2 conversions (min2cvt), and that one that uses lea:
I will edit this message and post other computer results:

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

553 cycles for 100 * C2D_J
1487 cycles for 100 * atodw
12890 cycles for 100 * sscanf
739 cycles for 100 * min2cvt
333 cycles for 100 * min1cvt
895 cycles for 100 * min_lea

562 cycles for 100 * C2D_J
1485 cycles for 100 * atodw
12871 cycles for 100 * sscanf
743 cycles for 100 * min2cvt
333 cycles for 100 * min1cvt
895 cycles for 100 * min_lea

566 cycles for 100 * C2D_J
1486 cycles for 100 * atodw
12900 cycles for 100 * sscanf
746 cycles for 100 * min2cvt
332 cycles for 100 * min1cvt
751 cycles for 100 * min_lea

561 cycles for 100 * C2D_J
1559 cycles for 100 * atodw
12603 cycles for 100 * sscanf
742 cycles for 100 * min2cvt
332 cycles for 100 * min1cvt
895 cycles for 100 * min_lea

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
77 bytes for min2cvt
44 bytes for min1cvt
92 bytes for min_lea

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax min_lea




wine FourCharsToDword.exe
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)

561 cycles for 100 * C2D_J
1912 cycles for 100 * atodw
19520 cycles for 100 * sscanf
672 cycles for 100 * min2cvt
458 cycles for 100 * min1cvt
765 cycles for 100 * min_lea

568 cycles for 100 * C2D_J
1919 cycles for 100 * atodw
19520 cycles for 100 * sscanf
670 cycles for 100 * min2cvt
457 cycles for 100 * min1cvt
767 cycles for 100 * min_lea

561 cycles for 100 * C2D_J
1918 cycles for 100 * atodw
19488 cycles for 100 * sscanf
672 cycles for 100 * min2cvt
456 cycles for 100 * min1cvt
767 cycles for 100 * min_lea

571 cycles for 100 * C2D_J
1912 cycles for 100 * atodw
19498 cycles for 100 * sscanf
671 cycles for 100 * min2cvt
458 cycles for 100 * min1cvt
769 cycles for 100 * min_lea

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
77 bytes for min2cvt
44 bytes for min1cvt
92 bytes for min_lea

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax min_lea

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 12:32:30 AM
I changed my previous code to be string pointer as argument:
Previous cycles result don't changed.
I removed lea algo.
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 11, 2021, 01:34:32 AM
Quote from: mineiro on November 11, 2021, 12:32:30 AM
I changed my previous code to be string pointer as argument

I'm afraid you are not passing a string pointer:
mov eax, "4321" ;first value to be converted
mov edx, "1234" ;second value to be converted


mov eax, chr$("1234") would pass a string pointer, as in
  .Repeat
push eax
invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
pop eax
dec ebx
  .Until Sign?
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 07:08:29 AM
Quote from: jj2007 on November 11, 2021, 01:34:32 AM
I'm afraid you are not passing a string pointer:
mov eax, "4321" ;first value to be converted
mov edx, "1234" ;second value to be converted

mov eax, chr$("1234") would pass a string pointer, as in

My and your procedure are passing data direct loaded into eax register while other functions (atodw/crt_sscanf) are passing data pointer by stack.
RULES:
All procedures should pass a pointer to data by using stack, stdcall.
I changed C2D_J,min2cvt and min1cvt.


wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

557 cycles for 100 * C2D_J
1524 cycles for 100 * atodw
12665 cycles for 100 * sscanf
767 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt

579 cycles for 100 * C2D_J
1499 cycles for 100 * atodw
12649 cycles for 100 * sscanf
785 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt

602 cycles for 100 * C2D_J
1496 cycles for 100 * atodw
12631 cycles for 100 * sscanf
783 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt

591 cycles for 100 * C2D_J
1497 cycles for 100 * atodw
12631 cycles for 100 * sscanf
747 cycles for 100 * min2cvt
431 cycles for 100 * min1cvt

54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 11, 2021, 11:38:43 AM
Excellent, mineiro :thumbsup:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

467     cycles for 100 * C2D_J
1592    cycles for 100 * atodw
21900   cycles for 100 * sscanf
761     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1588    cycles for 100 * atodw
21908   cycles for 100 * sscanf
763     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21898   cycles for 100 * sscanf
768     cycles for 100 * min2cvt
549     cycles for 100 * min1cvt

468     cycles for 100 * C2D_J
1591    cycles for 100 * atodw
21924   cycles for 100 * sscanf
767     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
Title: Re: ASM for FUN NEW step #1
Post by: raymond on November 11, 2021, 03:18:06 PM
I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.
I wrote some code to prepare such an LUT (once which should take only a few nanosecs) then used it to convert an ascii number 1,000,000,000 times (10^9). The timing on my older laptop was 1.5 secs!

Code for preparing LUT
.data
      t10   dd    10 DUP(0)
      t100  dd    10 DUP(0)
      t1000 dd    10 DUP(0)

      numbers     db    "3578"

start:

;fill LUTs of multiples
;of 10
      lea   edi,t10
      xor   eax,eax
      mov   edx,10
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;of 100
      xor   eax,eax
      mov   edx,100
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;of 1000
      xor   eax,eax
      mov   edx,1000
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;there is NO need to have a look-up table for multiples of units!!!


Then I ran the following code for converting ascii to binary, using the QueryPerformanceFrequency function for the timing

      mov   ecx,1000000000
start1:

      lea   esi,numbers
      mov   edx,[esi]
      and   edx,0f0f0f0fh
      movzx ebx,dl            ;1000s
      mov   eax,t1000[ebx*4]
      movzx ebx,dh            ;100s
      add   eax,t100[ebx*4]
      bswap edx
      movzx ebx,dh            ;10s
      add   eax,t10[ebx*4]
      movzx ebx,dl            ;units
      add   eax,ebx

      dec   ecx
      jnz   start1
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 11, 2021, 06:13:44 PM
AMD Athlon(tm) II X2 220 Processor (SSE3)

833     cycles for 100 * C2D_J
2410    cycles for 100 * atodw
34220   cycles for 100 * sscanf
1296    cycles for 100 * min2cvt
1057    cycles for 100 * min1cvt

806     cycles for 100 * C2D_J
2437    cycles for 100 * atodw
34498   cycles for 100 * sscanf
1297    cycles for 100 * min2cvt
1053    cycles for 100 * min1cvt

805     cycles for 100 * C2D_J
2436    cycles for 100 * atodw
34558   cycles for 100 * sscanf
1297    cycles for 100 * min2cvt
1055    cycles for 100 * min1cvt

805     cycles for 100 * C2D_J
2410    cycles for 100 * atodw
34503   cycles for 100 * sscanf
1296    cycles for 100 * min2cvt
1060    cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt

--- ok ---
LUT with Pelles C1: #include <stdio.h>
2: int a10[] = {0,10,20,30,40,50,60,70,80,90};
3: int a100[] = {0,100,200,300,400,500,600,700,800,900};
4: int a1000[] = {0,1000,2000,3000,4000,5000,6000,7000,8000,9000};
5:
6: int __cdecl main(void)

_main:
  [00000000] 55                     push              ebp
  [00000001] 89E5                   mov               ebp,esp
  [00000003] 83EC08                 sub               esp,8
7: {
8: char anum[] = "1234";
  [00000006] C745FB31323334         mov               dword ptr [ebp-5],34333231
  [0000000D] C645FF00               mov               byte ptr [ebp-1],0
9: int num;
10:
11: (*(int*)&anum) &= 0x0f0f0f0f;
  [00000011] 8165FB0F0F0F0F         and               dword ptr [ebp-5],F0F0F0F
12: num = a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
  [00000018] 0FBE45FB               movsx             eax,byte ptr [ebp-5]
  [0000001C] 8B048500000000         mov               eax,dword ptr [eax*4+_a1000]
  [00000023] 0FBE55FC               movsx             edx,byte ptr [ebp-4]
  [00000027] 03049500000000         add               eax,dword ptr [edx*4+_a100]
  [0000002E] 0FBE55FD               movsx             edx,byte ptr [ebp-3]
  [00000032] 03049500000000         add               eax,dword ptr [edx*4+_a10]
  [00000039] 0FBE55FE               movsx             edx,byte ptr [ebp-2]
  [0000003D] 01D0                   add               eax,edx
13: printf("%u\n", num);
  [0000003F] 50                     push              eax
  [00000040] 6800000000             push              @6
  [00000045] E800000000             call              _printf
  [0000004A] 83C408                 add               esp,8
14: return 0;
  [0000004D] 31C0                   xor               eax,eax
  [0000004F] 89EC                   mov               esp,ebp
  [00000051] 5D                     pop               ebp
  [00000052] C3                     ret               
using temporary string
char anum2[] = "1234";
char anum[5];
int num;

12: (*(int*)&anum) = (*(int*)&anum2) & 0x0f0f0f0f;
  [00000011] 8B45FB                 mov               eax,dword ptr [ebp-5]
  [00000014] 250F0F0F0F             and               eax,F0F0F0F
  [00000019] 8945F6                 mov               dword ptr [ebp-A],eax
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 11, 2021, 08:28:03 PM
Quote from: raymond on November 11, 2021, 03:18:06 PM
I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.

Thank you, Raymond. Here it is as ConvertLUT:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

475     cycles for 100 * C2D_J
1597    cycles for 100 * atodw
22224   cycles for 100 * sscanf
772     cycles for 100 * min2cvt
555     cycles for 100 * min1cvt
640     cycles for 100 * ConvertLUT

497     cycles for 100 * C2D_J
1595    cycles for 100 * atodw
22401   cycles for 100 * sscanf
800     cycles for 100 * min2cvt
584     cycles for 100 * min1cvt
631     cycles for 100 * ConvertLUT

489     cycles for 100 * C2D_J
1598    cycles for 100 * atodw
22796   cycles for 100 * sscanf
776     cycles for 100 * min2cvt
551     cycles for 100 * min1cvt
640     cycles for 100 * ConvertLUT

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 09:44:28 PM
Good job sir raymond and sir TimoVJL;
I created a static aligned LUT table to your procedures in data section.
I passed data pointer by stack, so, inserted a few modifications to deal with this in both codes.

raymond; when I changed bswap in your code by shr and respective changes it performed a bit better in my machine.
I'm leaving original code in testcase, only changed ebx to ecx register usage to follow register preservations without need of push/pop ebx.

TimoVJL; your code suffered performance because value being passed by stack instead of data pointer being passed by stack. I know this can be done inside our code, but just to follow rules. I have tried to insert same overhead in all procedures so result should be honest.
After "and 0f0f0f0fh" I inserted value back to stack. So, I confess it's not the best try because data is in register and go back to stack and read stack data digits each time. If you know better modifications please tell.

I have tried to not touch in yours code, only adapt to windows rules.

sir jj2007; in your testcase some procedures are getting data pointer from register while other procedures are receiving that from stack. This is creating different overheads. In first case it's necessary "mov reg,[reg]" while in other "mov reg,[esp+4] mov reg,[reg]", and at end of procedure first procedure do only "retn" while other "ret 4". Suggestions?


wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

562 cycles for 100 * C2D_J
1482 cycles for 100 * atodw
12649 cycles for 100 * sscanf
752 cycles for 100 * min2cvt
477 cycles for 100 * min1cvt
471 cycles for 100 * ray_LUT
741 cycles for 100 * TimoVJL_PellesC_LUT

638 cycles for 100 * C2D_J
1480 cycles for 100 * atodw
12835 cycles for 100 * sscanf
756 cycles for 100 * min2cvt
433 cycles for 100 * min1cvt
476 cycles for 100 * ray_LUT
744 cycles for 100 * TimoVJL_PellesC_LUT

564 cycles for 100 * C2D_J
1505 cycles for 100 * atodw
12630 cycles for 100 * sscanf
778 cycles for 100 * min2cvt
427 cycles for 100 * min1cvt
470 cycles for 100 * ray_LUT
729 cycles for 100 * TimoVJL_PellesC_LUT

561 cycles for 100 * C2D_J
1478 cycles for 100 * atodw
12558 cycles for 100 * sscanf
751 cycles for 100 * min2cvt
432 cycles for 100 * min1cvt
465 cycles for 100 * ray_LUT
730 cycles for 100 * TimoVJL_PellesC_LUT

54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt
62 bytes for ray_LUT
70 bytes for TimoVJL_PellesC_LUT

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ray_LUT
1234 = eax TimoVJL_PellesC_LUT


PS: sir jj2007 told me in pm that my executables are not running from GUI, so, you must open console to run; or if someone can create an executable direct using windows just post that. I'm doing these thing using linux. Appreciate.
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 09:57:43 PM
These are results from jj2007 testcase:

wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

479 cycles for 100 * C2D_J
1528 cycles for 100 * atodw
12745 cycles for 100 * sscanf
725 cycles for 100 * min2cvt
493 cycles for 100 * min1cvt
502 cycles for 100 * ConvertLUT

478 cycles for 100 * C2D_J
1610 cycles for 100 * atodw
12623 cycles for 100 * sscanf
745 cycles for 100 * min2cvt
495 cycles for 100 * min1cvt
501 cycles for 100 * ConvertLUT

550 cycles for 100 * C2D_J
1483 cycles for 100 * atodw
12590 cycles for 100 * sscanf
741 cycles for 100 * min2cvt
581 cycles for 100 * min1cvt
500 cycles for 100 * ConvertLUT

50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT

1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 11:34:15 PM
Thanks for the tip sir jj2007;
I checked command line when building executable and subsystem:WINDOWS was default; I changed that to CONSOLE.
Link version that I was using is v8.???, changed that to Version 5.12.8078.
So, I suppose now can work by double clicking.
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:

min1cvt proc _data:dword
option prologue:none
option epilogue:none

mov ecx,[esp+4]
mov eax,[ecx]
xor eax,"0000"      ;and eax,0f0f0f0fh
movd xmm0,eax       ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1   ;mul and add
;---------------------------
;db 66h,0fh,38h,02h,0c0h ;          phaddd xmm0,xmm0 ;SSE3
;movd eax,xmm0

movd eax,xmm0
psrlq xmm0,32
movd ecx,xmm0
add eax,ecx
ret 4
min1cvt endp


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

655     cycles for 100 * C2D_J
1605    cycles for 100 * atodw
22178   cycles for 100 * sscanf
785     cycles for 100 * min2cvt
475     cycles for 100 * min1cvt
599     cycles for 100 * ray_LUT
661     cycles for 100 * TimoVJL_PellesC_LUT

651     cycles for 100 * C2D_J
1599    cycles for 100 * atodw
22808   cycles for 100 * sscanf
823     cycles for 100 * min2cvt
508     cycles for 100 * min1cvt
607     cycles for 100 * ray_LUT
717     cycles for 100 * TimoVJL_PellesC_LUT

714     cycles for 100 * C2D_J
1615    cycles for 100 * atodw
22657   cycles for 100 * sscanf
778     cycles for 100 * min2cvt
488     cycles for 100 * min1cvt
587     cycles for 100 * ray_LUT
657     cycles for 100 * TimoVJL_PellesC_LUT

696     cycles for 100 * C2D_J
1613    cycles for 100 * atodw
22193   cycles for 100 * sscanf
778     cycles for 100 * min2cvt
487     cycles for 100 * min1cvt
595     cycles for 100 * ray_LUT
664     cycles for 100 * TimoVJL_PellesC_LUT

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
84      bytes for min2cvt
52      bytes for min1cvt
62      bytes for ray_LUT
70      bytes for TimoVJL_PellesC_LUT

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ray_LUT
1234    = eax TimoVJL_PellesC_LUT


It's a pity that this is limited to the exotic test case "four digits" :sad:
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 12, 2021, 02:00:42 AM
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
It's a pity that this is limited to the exotic test case "four digits" :sad:
its not,thanks everyone for testing different approaches to fast ascii->integer :thumbsup:
there is still fastest ascii->double and fastest ascii->64bit integer
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 12, 2021, 06:13:25 AM
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.

I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax       ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 12, 2021, 06:54:11 AM
Quote from: mineiro on November 12, 2021, 06:13:25 AM
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.

I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax       ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...
change to newer assembler than ml 6.14 solves it without need for macros

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 12, 2021, 08:08:40 AM
Quote from: daydreamer on November 12, 2021, 06:54:11 AM
change to newer assembler than ml 6.14 solves it without need for macros
Thanks sir daydreamer.
Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 12, 2021, 09:28:27 AM
That's great Guys.

I am happy to see many options are coming out and the speed of the code is quite nice.  :thumbsup:

Compliments to everyone partecipating in the game.  :eusa_clap:
Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 12, 2021, 11:03:09 AM
Here is another approach, a finite state machine, AKA look up tree. 4 character strings fed in, integers fed out. Its big but its reasonably fast. Timings below are with my old i7.

Output

2
9999
5432
8619
9876
2345
1234
7776
8193
4352
test return values - done

timing 10 million matches next, press any key ....

ms = 109 duration
Press any key to continue ...
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 12, 2021, 11:45:23 AM
Finite state machine, AKA look up tree vs my algo. I hope I didn't make any mistakes in translation. The attachment does not contain the num.asm file - see Hutch' post.
Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 12, 2021, 12:31:28 PM
 :thumbsup:

That is genuinely fast and it looks like its producing the right output numbers.
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 12, 2021, 01:01:42 PM
I managed to integrate Hutch' finite state machine into the testbed. To assemble it, the num.asm file must be in the same folder. It's still plain Masm32 SDK, MasmBasic is not required. The min1cvt algo is a clear winner :thumbsup:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

665     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21983   cycles for 100 * sscanf
766     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt
683     cycles for 100 * ConvertLUT
2292    cycles for 100 * FSM (Hutch)

690     cycles for 100 * C2D_J
1605    cycles for 100 * atodw
21970   cycles for 100 * sscanf
769     cycles for 100 * min2cvt
544     cycles for 100 * min1cvt
653     cycles for 100 * ConvertLUT
2326    cycles for 100 * FSM (Hutch)

660     cycles for 100 * C2D_J
1612    cycles for 100 * atodw
22241   cycles for 100 * sscanf
770     cycles for 100 * min2cvt
547     cycles for 100 * min1cvt
626     cycles for 100 * ConvertLUT
2263    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 12, 2021, 05:52:59 PM
The attached zip file contains the source for the FSM, it is every number between 0000 and 9999 and will test all of them if you make the function call.

It occurs in this form.

0000 0000
0001 0001
0002 0002
0003 0003
0004 0004
............
9995 9995
9996 9996
9997 9997
9998 9998
9999 9999


Pass the first string to the integers procedure and it will return the actual integer in EAX.
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 12, 2021, 06:40:44 PM
AMD Athlon(tm) II X2 220 Processor (SSE3)

1186    cycles for 100 * C2D_J
2428    cycles for 100 * atodw
35137   cycles for 100 * sscanf
1304    cycles for 100 * min2cvt
1059    cycles for 100 * min1cvt
712     cycles for 100 * ConvertLUT
2780    cycles for 100 * FSM (Hutch)

1189    cycles for 100 * C2D_J
2426    cycles for 100 * atodw
34990   cycles for 100 * sscanf
1305    cycles for 100 * min2cvt
1065    cycles for 100 * min1cvt
721     cycles for 100 * ConvertLUT
2738    cycles for 100 * FSM (Hutch)

1185    cycles for 100 * C2D_J
2427    cycles for 100 * atodw
34735   cycles for 100 * sscanf
1311    cycles for 100 * min2cvt
1058    cycles for 100 * min1cvt
710     cycles for 100 * ConvertLUT
3155    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

532     cycles for 100 * C2D_J
1401    cycles for 100 * atodw
25676   cycles for 100 * sscanf
523     cycles for 100 * min2cvt
435     cycles for 100 * min1cvt
615     cycles for 100 * ConvertLUT
2124    cycles for 100 * FSM (Hutch)

524     cycles for 100 * C2D_J
1371    cycles for 100 * atodw
26048   cycles for 100 * sscanf
524     cycles for 100 * min2cvt
431     cycles for 100 * min1cvt
618     cycles for 100 * ConvertLUT
2129    cycles for 100 * FSM (Hutch)

512     cycles for 100 * C2D_J
1383    cycles for 100 * atodw
25954   cycles for 100 * sscanf
517     cycles for 100 * min2cvt
434     cycles for 100 * min1cvt
615     cycles for 100 * ConvertLUT
2127    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

-
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 13, 2021, 11:45:28 PM
Example how msvc C optimizer works:
int str4int( char *s)
{
char anum[5];
(*(int*)&anum) = (*(int*)s) & 0x0f0f0f0f;
return a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
}

_str4int:
  [00000000] 8B442404               mov               eax,dword ptr [esp+4]
  [00000004] 53                     push              ebx
  [00000005] 8B18                   mov               ebx,dword ptr [eax]
  [00000007] 81E30F0F0F0F           and               ebx,F0F0F0Fh
  [0000000D] 8BC3                   mov               eax,ebx
  [0000000F] C1E810                 shr               eax,10h
  [00000012] 0FB6D0                 movzx             edx,al
  [00000015] 8BC3                   mov               eax,ebx
  [00000017] C1E808                 shr               eax,8
  [0000001A] 0FB6C8                 movzx             ecx,al
  [0000001D] 8B049500000000         mov               eax,dword ptr [edx*4+_a10]
  [00000024] 03048D00000000         add               eax,dword ptr [ecx*4+_a100]
  [0000002B] 0FB6CB                 movzx             ecx,bl
  [0000002E] C1EB18                 shr               ebx,18h
  [00000031] 03048D00000000         add               eax,dword ptr [ecx*4+_a1000]
  [00000038] 03C3                   add               eax,ebx
  [0000003A] 5B                     pop               ebx
  [0000003B] C3                     ret
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 14, 2021, 12:50:56 AM
New version, my algo and mineiro's are somewhat faster now:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

554     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
776     cycles for 100 * min2cvt
476     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2273    cycles for 100 * FSM (Hutch)

545     cycles for 100 * C2D_J
1575    cycles for 100 * atodw
768     cycles for 100 * min2cvt
478     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2252    cycles for 100 * FSM (Hutch)

548     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
765     cycles for 100 * min2cvt
478     cycles for 100 * min1cvt (SSE3)
637     cycles for 100 * ConvertLUT
2262    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)


@Timo: I've tried the MSVC optimised version, but it's 16% slower and gives a wrong result :sad:
See below C2D_J: (or search for msvc inside the file)
Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 14, 2021, 01:23:17 AM
In my machine, "shr reg,16" performs better than "bswap", and respective changes after this modification. The final gain in cycles is about 010~013 cycles.
ConvertLUT can be optimized a bit more, by removing "push/pop ebx" and using ecx register instead.

These are results of last benchmark test:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

513 cycles for 100 * C2D_J
1475 cycles for 100 * atodw
687 cycles for 100 * min2cvt
463 cycles for 100 * min1cvt (SSE3)
640 cycles for 100 * ConvertLUT
1383 cycles for 100 * FSM (Hutch)

467 cycles for 100 * C2D_J
1477 cycles for 100 * atodw
696 cycles for 100 * min2cvt
456 cycles for 100 * min1cvt (SSE3)
639 cycles for 100 * ConvertLUT
1383 cycles for 100 * FSM (Hutch)

456 cycles for 100 * C2D_J
1474 cycles for 100 * atodw
702 cycles for 100 * min2cvt
503 cycles for 100 * min1cvt (SSE3)
494 cycles for 100 * ConvertLUT
1431 cycles for 100 * FSM (Hutch)

54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 14, 2021, 01:56:33 AM
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

453     cycles for 100 * C2D_J
1476    cycles for 100 * atodw
745     cycles for 100 * min2cvt
483     cycles for 100 * min1cvt (SSE3)
477     cycles for 100 * ConvertLUT
1397    cycles for 100 * FSM (Hutch)

457     cycles for 100 * C2D_J
1458    cycles for 100 * atodw
695     cycles for 100 * min2cvt
461     cycles for 100 * min1cvt (SSE3)
470     cycles for 100 * ConvertLUT
1366    cycles for 100 * FSM (Hutch)

450     cycles for 100 * C2D_J
1456    cycles for 100 * atodw
692     cycles for 100 * min2cvt
471     cycles for 100 * min1cvt (SSE3)
473     cycles for 100 * ConvertLUT
1374    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

-

@minerio about shr reg,16 vs bswap
with your own SSE version you should try if byte shuffle is faster?
my cpu is haswell based,so it supports avx2,anyone more than hutch that can test avx2 code?
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 14, 2021, 08:15:40 AM
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 14, 2021, 11:13:07 AM
Quote from: TimoVJL on November 14, 2021, 08:15:40 AM
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.

For me it didn't work, the result is 1114 instead of 1234. Please search for msvc inside FourCharsToDword.asm, it should be line 74; change line 56 to if 0 to test the msvc code. Maybe I made an error when copying & pasting your code :sad:
Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 14, 2021, 12:43:24 PM
To appeal to your sense of humour, I split the FSM into 4 procedures, much smaller file, got it to produce the right numbers but it was about 30% slower than the single FSM procedure.  :undecided:
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 14, 2021, 11:27:52 PM
Ok, here is version 5, with Timo's MSVC "optimised" algo. On my machine, mineiro's SSE3 algo is a clear winner:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

545     cycles for 100 * C2D_J
1587    cycles for 100 * atodw
769     cycles for 100 * min2cvt
503     cycles for 100 * min1cvt (SSE3)
626     cycles for 100 * ConvertLUT
2267    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

550     cycles for 100 * C2D_J
1593    cycles for 100 * atodw
768     cycles for 100 * min2cvt
477     cycles for 100 * min1cvt (SSE3)
624     cycles for 100 * ConvertLUT
2244    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

547     cycles for 100 * C2D_J
1589    cycles for 100 * atodw
768     cycles for 100 * min2cvt
483     cycles for 100 * min1cvt (SSE3)
624     cycles for 100 * ConvertLUT
2246    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 15, 2021, 12:23:28 AM
Minieros also my favourite
, but for ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 15, 2021, 12:46:38 AM
Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Really? How exactly would you do that?

ConvertLUT:
  mov edx, [eax] ; pointer to string, e.g. "1234"
  and edx, 0f0f0f0fh ; convert "1" to 1
  movzx ecx, dl ;1000s
  mov eax, t1000[ecx*4]
  movzx ecx,dh ;100s
  add eax, t100[ecx*4]
  bswap edx
  movzx ecx,dh ;10s
  add eax, t10[ecx*4]
  movzx ecx, dl ;units
  add eax, ecx
  retn
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 15, 2021, 02:15:03 AM
Sadly test program don't work with AMD Athlon(tm) II X2 220 Processor (SSE3)
AMD Athlon(tm) II X2 220 Processor (SSE3)

909     cycles for 100 * C2D_J
2480    cycles for 100 * atodw
1301    cycles for 100 * min2cvt
Title: Re: ASM for FUN NEW step #1
Post by: FORTRANS on November 15, 2021, 03:06:24 AM
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

556 cycles for 100 * C2D_J
1699 cycles for 100 * atodw
847 cycles for 100 * min2cvt
553 cycles for 100 * min1cvt (SSE3)
588 cycles for 100 * ConvertLUT
1597 cycles for 100 * FSM (Hutch)

552 cycles for 100 * C2D_J
1701 cycles for 100 * atodw
847 cycles for 100 * min2cvt
557 cycles for 100 * min1cvt (SSE3)
586 cycles for 100 * ConvertLUT
1596 cycles for 100 * FSM (Hutch)

554 cycles for 100 * C2D_J
1701 cycles for 100 * atodw
846 cycles for 100 * min2cvt
565 cycles for 100 * min1cvt (SSE3)
588 cycles for 100 * ConvertLUT
1587 cycles for 100 * FSM (Hutch)

54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)

--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

348 cycles for 100 * C2D_J
1080 cycles for 100 * atodw
501 cycles for 100 * min2cvt
323 cycles for 100 * min1cvt (SSE3)
359 cycles for 100 * ConvertLUT
1019 cycles for 100 * FSM (Hutch)

338 cycles for 100 * C2D_J
1051 cycles for 100 * atodw
509 cycles for 100 * min2cvt
366 cycles for 100 * min1cvt (SSE3)
385 cycles for 100 * ConvertLUT
1015 cycles for 100 * FSM (Hutch)

330 cycles for 100 * C2D_J
1119 cycles for 100 * atodw
516 cycles for 100 * min2cvt
326 cycles for 100 * min1cvt (SSE3)
376 cycles for 100 * ConvertLUT
1041 cycles for 100 * FSM (Hutch)

54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)

--- ok ---
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 15, 2021, 03:15:12 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

536     cycles for 100 * C2D_J
1365    cycles for 100 * atodw
517     cycles for 100 * min2cvt
353     cycles for 100 * min1cvt (SSE3)
615     cycles for 100 * ConvertLUT
2123    cycles for 100 * FSM (Hutch)
618     cycles for 100 * MSVC (Timo)

546     cycles for 100 * C2D_J
1363    cycles for 100 * atodw
516     cycles for 100 * min2cvt
350     cycles for 100 * min1cvt (SSE3)
615     cycles for 100 * ConvertLUT
2152    cycles for 100 * FSM (Hutch)
620     cycles for 100 * MSVC (Timo)

536     cycles for 100 * C2D_J
1371    cycles for 100 * atodw
568     cycles for 100 * min2cvt
350     cycles for 100 * min1cvt (SSE3)
610     cycles for 100 * ConvertLUT
2142    cycles for 100 * FSM (Hutch)
624     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 15, 2021, 06:16:56 AM
Quote from: jj2007 on November 15, 2021, 12:46:38 AM
Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Really? How exactly would you do that?

ConvertLUT:
  mov edx, [eax] ; pointer to string, e.g. "1234"
  and edx, 0f0f0f0fh ; convert "1" to 1
  movzx ecx, dl ;1000s
  mov eax, t1000[ecx*4]
  movzx ecx,dh ;100s
  add eax, t100[ecx*4]
  bswap edx
  movzx ecx,dh ;10s
  add eax, t10[ecx*4]
  movzx ecx, dl ;units
  add eax, ecx
  retn


.data
;integer part
t1000 real4 0.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0
t100 real4 0.0,100.0,200.0,300.0,400.0,500.0,600.0,700.0,800.0,900.0
t10 real4 0.0,10.0,20.0,30.0,40.0,50.0,60.0,70.0,80.0,90.0
t1 real4 0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
;decimals
t0dot1 real4 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
t0dot01 real4 0.0,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09
t0dot001 real4 0.0,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009
t0dot0001 real4 0.0,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.0009
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 15, 2021, 07:22:34 AM
Ok, these are the data, and now the code please, daydreamer :thumbsup:
Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 03:18:06 AM
deleted
Title: Re: ASM for FUN NEW step #1
Post by: raymond on November 16, 2021, 04:21:42 AM
Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.

The qeditor Help section has a listing for opcodes, along with the expected clock cycles for each. The BSWAP one has the following:
QuoteUsage:  BSWAP   reg32

Modifies flags: none

Changes the byte order of a 32 bit register from big endian to
little endian or vice versa.   Result left in destination register
is undefined if the operand is a 16 bit register.

                         Clocks                 Size
Operands         808x  286   386   486          Bytes
reg32                   -       -       -       1             2

0F C8+ rd BSWAP r32 Reverses the byte order of a 32-bit register.

If you consider BSWAP as a "relatively slow instruction", maybe Hutch should review all the tables of that Help section.

Then, in addition, you mention that "IMUL is actually a rather fast instruction", BUT the imul instruction help section in qeditor shows the following:
Quote
                                          Clocks                     Size
Operands                808x    286    386     486         Bytes
reg8                      80-98     13    9-14   13-18          2
reg16                  128-154    21    9-22   13-26          2
reg32                        -         -      9-38   12-42          2
mem8                   86-104    16  12-17   13-18         2-4
mem16               134-160    24  12-25   13-26         2-4
mem32                      -        -    12-41   13-42         2-4
reg16,reg16               -         -     9-22   13-26         3-5
reg32,reg32               -         -     9-38   13-42         3-5
reg16,mem16            -        -     12-25   13-26         3-5
reg32,mem32            -        -     12-41   13-42         3-5
reg16,immed             -       21     9-22   13-26           3
reg32,immed             -       21     9-38   13-42         3-6
reg16,reg16,immed    -        2      9-22   13-26         3-6
reg32,reg32,immed    -      21      9-38   13-42         3-6
reg16,mem16,immed  -     24    12-25    13-26         3-6
reg32,mem32,immed  -     24    12-41    13-42         3-6

Any explanation for this????
Could you provide us with YOUR sources for clock cycles?
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 16, 2021, 04:41:42 AM
Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.
...
    86804 cycles 3.asm: pmaddwd
    89518 cycles 1.asm: imul
    94277 cycles 4.asm: pmaddwd+bswap
   108254 cycles 2.asm: imul+bswap[/tt]

Thanks, Nidud. Here are my results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

546     cycles for 100 * C2D_J
793     cycles for 100 * imul (Nidud)
762     cycles for 100 * min2cvt
479     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2243    cycles for 100 * FSM (Hutch)
709     cycles for 100 * MSVC (Timo)

554     cycles for 100 * C2D_J
793     cycles for 100 * imul (Nidud)
775     cycles for 100 * min2cvt
476     cycles for 100 * min1cvt (SSE3)
627     cycles for 100 * ConvertLUT
2245    cycles for 100 * FSM (Hutch)
696     cycles for 100 * MSVC (Timo)

546     cycles for 100 * C2D_J
792     cycles for 100 * imul (Nidud)
766     cycles for 100 * min2cvt
475     cycles for 100 * min1cvt (SSE3)
643     cycles for 100 * ConvertLUT
2252    cycles for 100 * FSM (Hutch)
703     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)


Quote from: raymond on November 16, 2021, 04:21:42 AM
Could you provide us with YOUR sources for clock cycles?

I don't know what Nidud's sources are, but here is mine - Agner Fog: one cycle for bswap (https://www.agner.org/optimize/instruction_tables.pdf)
Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 16, 2021, 05:14:19 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

537     cycles for 100 * C2D_J
723     cycles for 100 * imul (Nidud)
571     cycles for 100 * min2cvt
339     cycles for 100 * min1cvt (SSE3)
521     cycles for 100 * ConvertLUT
2133    cycles for 100 * FSM (Hutch)
630     cycles for 100 * MSVC (Timo)

542     cycles for 100 * C2D_J
632     cycles for 100 * imul (Nidud)
510     cycles for 100 * min2cvt
340     cycles for 100 * min1cvt (SSE3)
510     cycles for 100 * ConvertLUT
2134    cycles for 100 * FSM (Hutch)
621     cycles for 100 * MSVC (Timo)

537     cycles for 100 * C2D_J
639     cycles for 100 * imul (Nidud)
511     cycles for 100 * min2cvt
341     cycles for 100 * min1cvt (SSE3)
510     cycles for 100 * ConvertLUT
2131    cycles for 100 * FSM (Hutch)
619     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)
Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 05:39:56 AM
deleted
Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 06:33:20 AM
deleted
Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 16, 2021, 07:45:15 AM
wonder how fast avx2 will be?
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 16, 2021, 08:26:41 AM
Quote from: daydreamer on November 16, 2021, 07:45:15 AM
wonder how fast avx2 will be?

We are still waiting for your code in reply #61, daydreamer.
Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 08:47:36 AM
deleted
Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 09:21:23 AM
deleted
Title: Re: ASM for FUN NEW step #1
Post by: Siekmanski on November 16, 2021, 09:52:41 AM
AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

343     cycles for 100 * C2D_J
660     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
346     cycles for 100 * min1cvt (SSE3)
424     cycles for 100 * ConvertLUT
984     cycles for 100 * FSM (Hutch)
350     cycles for 100 * MSVC (Timo)

343     cycles for 100 * C2D_J
659     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
347     cycles for 100 * min1cvt (SSE3)
424     cycles for 100 * ConvertLUT
970     cycles for 100 * FSM (Hutch)
546     cycles for 100 * MSVC (Timo)

341     cycles for 100 * C2D_J
662     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
354     cycles for 100 * min1cvt (SSE3)
412     cycles for 100 * ConvertLUT
969     cycles for 100 * FSM (Hutch)
347     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

-
Title: Re: ASM for FUN NEW step #1
Post by: Greenhorn on November 16, 2021, 10:27:52 AM
AMD Ryzen 7 3700X 8-Core Processor              (SSE4)

362 cycles for 100 * C2D_J
496 cycles for 100 * imul (Nidud)
365 cycles for 100 * min2cvt
282 cycles for 100 * min1cvt (SSE3)
438 cycles for 100 * ConvertLUT
1093 cycles for 100 * FSM (Hutch)
304 cycles for 100 * MSVC (Timo)

362 cycles for 100 * C2D_J
494 cycles for 100 * imul (Nidud)
365 cycles for 100 * min2cvt
277 cycles for 100 * min1cvt (SSE3)
439 cycles for 100 * ConvertLUT
1092 cycles for 100 * FSM (Hutch)
292 cycles for 100 * MSVC (Timo)

374 cycles for 100 * C2D_J
495 cycles for 100 * imul (Nidud)
364 cycles for 100 * min2cvt
279 cycles for 100 * min1cvt (SSE3)
437 cycles for 100 * ConvertLUT
1100 cycles for 100 * FSM (Hutch)
292 cycles for 100 * MSVC (Timo)

54 bytes for C2D_J
58 bytes for imul (Nidud)
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
122 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax imul (Nidud)
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)

--- ok ---
Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 16, 2021, 10:38:23 AM
I opened a new thread called "Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,.. (http://masm32.com/board/index.php?topic=9638.0)" in the Lab :cool:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23      cycles for 100 * imul 10
54      cycles for 100 * lea: *10
4871    cycles for 100 * lodsd (25 DWORDs)
4863    cycles for 100 * mov eax, [esi] + add esi, 4
58      cycles for 100 * lea10, add eax
57      cycles for 100 * lea10, shl eax, 1
27      cycles for 100 * bswap
62      cycles for 100 * ror 16