Text only | Text with Images

The MASM Forum

General => The Campus => Topic started by: frktons on November 09, 2021, 09:04:58 AM

Title: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 09:04:58 AM

Hello.

I have just created a starting point for validating the ALGO to convert
some millions strings into numbers.

Here is the example I used in PowerBasic, that is considered quite a good Compiler,
and the results I got. From here the ASM versions should start and confront the speed
against it.

The Code I used, quite simple:

QuoteSUB Convert

TIX CycleCount

FOR Counter = 1 TO Indice

MyNumbers(Counter) = VAL(MyStrings(Counter))

NEXT

TIX END CycleCount

END SUB

The strings are 7,110,268 in the string array.
The CPU tix used: 2,607,804,194

Let's see how we can speed it up with some ASM code. :eusa_dance:

The file that I use for the test is quite big > 300 MB. They are real data from
S&P 500 quotations from 1997 up to now with one minute lapse.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 09, 2021, 10:53:52 AM

Try the sval("123") macro, it's part of masm32rt.inc:

Code Select

include \masm32\include\masm32rt.inc

.data
x123	db "-12345", 0

.code
start:
  MsgBox 0, cat$(str$(sval(offset x123)), " is the value"), "val is great:", MB_OK
  exit

end start

MasmBasic Val(a string) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1201) is probably a bit slower (it's an allrounder that reads dec, hex, bin in various flavours).

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 09, 2021, 01:11:54 PM

This algo only deals with non signed numbers.

Code Select


;first approach
;"1234" == 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0 == 1000+200+30+4
mov eax,"1234"          ;30303030h to 39393939h
xor eax,"0000"          ;sub "0000"
xor ecx,ecx
xor edx,edx
xor ebx,ebx

shld ecx,eax,8          ;"1"
shl eax,8               ;"234"
lea ebx,[ecx*8]         ;1*8=8
lea ebx,[ebx+ecx*2]     ;8+(1*2)=10

xor ecx,ecx
shld ecx,eax,8          ;"2"
shl eax,8               ;"34"
add ebx,ecx             ;10+2=12
lea edx,[ebx*8]         ;12*8=96
lea edx,[edx+ebx*2]     ;96+(12*2)=120

xor ecx,ecx
xor ebx,ebx
shld ecx,eax,8          ;"3"
shl eax,8               ;"4"
add edx,ecx             ;120+3=123
lea ebx,[edx*8]         ;123*8=984
lea ebx,[ebx+edx*2]     ;984+(123*2)=1230

xor ecx,ecx
shld ecx,eax,8          ;"4"
add ebx,ecx             ;1230+4=1234
;ebx="1234" converted to binary equals to 4d2h

You can create a lookup table with all possible combinations from 0000 to 9999 instead of do calculus of each number each time. Will have a execution time gain in 1 million attempts.

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 01:59:03 PM

Quote from: jj2007 on November 09, 2021, 10:53:52 AM
Try the sval("123") macro, it's part of masm32rt.inc:

Code Select Expand
include \masm32\include\masm32rt.inc .data x123 db "-12345", 0 .code start: MsgBox 0, cat$(str$(sval(offset x123)), " is the value"), "val is great:", MB_OK exit end start

MasmBasic Val(a string) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1201) is probably a bit slower (it's an allrounder that reads dec, hex, bin in various flavours).

For didactical purpose I'd like to use just pure and simple ASM opcodes, registers, jumps, moves, adds and so on.
Using a MACRO or a FUNCTION that somebody else has assembled can just be done to test the efficiency and speed
but it's not fully satisfying for my thirst of ASM knowledge :-)

Moreover I can't find the sval() macro in MASM32 help, so I can't see what is it and how it works.

Thanks anyway JJ for your suggestions. :-)

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 02:00:50 PM

Quote from: mineiro on November 09, 2021, 01:11:54 PM
This algo only deals with non signed numbers.
Code Select Expand
;first approach ;"1234" == 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0 == 1000+200+30+4 mov eax,"1234" ;30303030h to 39393939h xor eax,"0000" ;sub "0000" xor ecx,ecx xor edx,edx xor ebx,ebx shld ecx,eax,8 ;"1" shl eax,8 ;"234" lea ebx,[ecx*8] ;1*8=8 lea ebx,[ebx+ecx*2] ;8+(1*2)=10 xor ecx,ecx shld ecx,eax,8 ;"2" shl eax,8 ;"34" add ebx,ecx ;10+2=12 lea edx,[ebx*8] ;12*8=96 lea edx,[edx+ebx*2] ;96+(12*2)=120 xor ecx,ecx xor ebx,ebx shld ecx,eax,8 ;"3" shl eax,8 ;"4" add edx,ecx ;120+3=123 lea ebx,[edx*8] ;123*8=984 lea ebx,[ebx+edx*2] ;984+(123*2)=1230 xor ecx,ecx shld ecx,eax,8 ;"4" add ebx,ecx ;1230+4=1234 ;ebx="1234" converted to binary equals to 4d2h
You can create a lookup table with all possible combinations from 0000 to 9999 instead of do calculus of each number each time. Will have a execution time gain in 1 million attempts.

Yes Mineiro, this is the kind of reasoning I'm doing myself.
Just need time to regain my ability to use the MASM opcodes.

My data set has not negative numbers, all of them are not signed.

Thanks

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 02:24:00 PM

I am thinking about an ASCII approach to the conversion from string to number.
If I have the string number "4735", every digit has an ASCII representation, and value:

4 is ASCII 52
7 is ASCII 55
3 is ASCII 51
5 is ASCII 53

The four bytes in memory have these ASCII codes 52,55,51,53.

If I move these 4 bytes into four registers: EAX, EBX, ECX, EDX
and subtract from each the value 48, I have the value

4 in EAX
7 in EBX
3 in ECX
5 in EDX

EAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units

the sum of the contents of the 4 registers is the number that is represented in the ASCII string.

I don't know if this approach is efficient for speed, but the logic should work.

If there are MMX or SSE opcodes that do in parallel the work on all the four bytes/registers the
efficiency should be better. But I am not aware at the moment if these opcodes exist.

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 09, 2021, 09:41:54 PM

Quote from: frktons on November 09, 2021, 02:24:00 PM
I am thinking about an ASCII approach to the conversion from string to number.
If I have the string number "4735", every digit has an ASCII representation, and value:

4 is ASCII 52
7 is ASCII 55
3 is ASCII 51
5 is ASCII 53

The four bytes in memory have these ASCII codes 52,55,51,53.

If I move these 4 bytes into four registers: EAX, EBX, ECX, EDX
and subtract from each the value 48, I have the value

4 in EAX
7 in EBX
3 in ECX
5 in EDX

EAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units

the sum of the contents of the 4 registers is the number that is represented in the ASCII string.

I don't know if this approach is efficient for speed, but the logic should work.

If there are MMX or SSE opcodes that do in parallel the work on all the four bytes/registers the
efficiency should be better. But I am not aware at the moment if these opcodes exist.

you can get ascii numbers being " 123" too instead of "0123",so I prefer AND to mask out 32+16 instead of sub

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 09:48:13 PM

Yes daydreamer, I think it is an efficient way of getting the same result.

I am not used to using these tricks, and my ASM is a bit rusty from long inactivity.

But I understand the point. Thanks

In the data set I am using there are no spaces, but leading "0" if the case.
There are no cases of leading zero, all the values are between 1000 and 5000 indeed.

:thumbsup:

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 09, 2021, 10:13:07 PM

Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units

Code Select

include \masm32\include\masm32rt.inc

.data
x123	db "1234", 0

.code
start:
  mov ecx, dword ptr x123
  movzx eax, cl		; "1"
  and eax, 15		; 1
  imul eax, eax, 1000	; 1000
  movzx edx, ch		; "2"
  and edx, 15		; 2
  imul edx, edx, 100	; 200
  add eax, edx		; 1200
  bswap ecx
  movzx edx, ch		; "3"
  and edx, 15		; 3
  imul edx, edx, 10	; 30
  add eax, edx		; 1230
  and ecx, 15		; "4" -> 4
  add eax, ecx		; 1234
  MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK
  exit
end start

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 09, 2021, 10:18:52 PM

Hi JJ.

Nice solution using only one register and its one byte segments.

Great idea, compliment my friend.

I was thinking about something similar, but I was wondering if Full 32 bit registers are faster
doing these operations. Maybe not. Need to check it out.

:thumbsup:

Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 09, 2021, 11:32:05 PM

What about a lookup table with 0000 to 9999 members ? Only 40k.

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 01:15:12 AM

This is a variation from daydreamer code.

Code Select


.data
align 8
total dw 1,10,100,1000
.code
movd xmm1,qword ptr [total]
pxor xmm2,xmm2

mov eax,"1234"      ;mov eax,dword ptr []
                    ;bswap eax
xor eax,"0000"      ;and eax,0f0f0f0fh
movd xmm0,eax       ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1   ;mul and add
phaddd xmm0,xmm0    ;add
movd eax,xmm0       ;eax=4d2h

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 02:33:45 AM

Quote from: hutch-- on November 09, 2021, 11:32:05 PM
What about a lookup table with 0000 to 9999 members ? Only 40k.

How would you look up, for example, "1234"? Conversion is from Ascii to binary...

Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 10, 2021, 06:32:03 AM

Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 10, 2021, 06:44:17 AM

Quote from: mineiro on November 10, 2021, 01:15:12 AM
This is a variation from daydreamer code.
Code Select Expand
.data align 8 total dw 1,10,100,1000 .code movd xmm1,qword ptr [total] pxor xmm2,xmm2 mov eax,"1234" ;mov eax,dword ptr [] ;bswap eax xor eax,"0000" ;and eax,0f0f0f0fh movd xmm0,eax ;01020304h punpcklbw xmm0,xmm2 ;0001000200030004h pmaddwd xmm0,xmm1 ;mul and add phaddd xmm0,xmm0 ;add movd eax,xmm0 ;eax=4d2h

great :thumbsup:
actually my code was just first to get it working right,after that make ascii->double and double->ascii and possible 128bit and 256bit
@Hutch
LUT works fast,how fast if there is a prime among first 1-65535 LUT used code somewhere in forum

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 07:59:05 AM

Quote from: hutch-- on November 10, 2021, 06:32:03 AM
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.

This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...

Let's do some timings, just for fun? :biggrin:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

469     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
21894   cycles for 100 * sscanf

485     cycles for 100 * C2D_J
1582    cycles for 100 * atodw
22022   cycles for 100 * sscanf

469     cycles for 100 * C2D_J
1590    cycles for 100 * atodw
21863   cycles for 100 * sscanf

476     cycles for 100 * C2D_J
1573    cycles for 100 * atodw
21894   cycles for 100 * sscanf

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 09:06:41 AM

Quote from: jj2007 on November 10, 2021, 07:59:05 AM
This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...

I was thinking something like this:

Code Select


lea esi,lookup_table
mov edx,0f0f0f0fh       ;mask, lower bits of each byte

mov ecx,"1234"          ;source
pext eax,ecx,edx        ;index=eax=00001234h, 64k lookup table *2 (element sizeof)    ;BMI2 cpuid
movzx ebx,word ptr [esi+eax*2]      ;ebx=converted value

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 10, 2021, 09:22:36 AM

I tried a similar solution like Steve suggested, but in a different logic and shape.
I used 4 X 64 elements array, and at the elements with the index = ASCII code (for "1" that is 49)
I initialized the 49th(ASCII code for "1") element of Thousand array with the value 1000 and did the same
for others value up to "9".
This array was for Thousands, the second array was for Hundreds, and The Tens, and Units.

The conversion from "1234" to its corresponding number is just a SUM Thousands(ASCII code first digit) +
SUM Hundreds(ASCII code second digit) + SUM Tens(ASCII code third digit) + SUM Units(ASCII code fourth digit).

I tried it in PowerBasic, and it is 30% faster than the standard VAL() function.

I suppose in ASM it can be much faster because we can use registers, shifts, and ADD to get the
result with arrays with the values we need: 1-9, 10-90,100-900, 1000-9000.

Thanks all for your suggestions and inspiration. It's a great place to be MASM FORUM.

:thumbsup:

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 10, 2021, 10:26:30 AM

Quote from: jj2007 on November 09, 2021, 10:13:07 PM
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units

Code Select Expand
include \masm32\include\masm32rt.inc .data x123 db "1234", 0 .code start: mov ecx, dword ptr x123 movzx eax, cl ; "1" and eax, 15 ; 1 imul eax, eax, 1000 ; 1000 movzx edx, ch ; "2" and edx, 15 ; 2 imul edx, edx, 100 ; 200 add eax, edx ; 1200 bswap ecx movzx edx, ch ; "3" and edx, 15 ; 3 imul edx, edx, 10 ; 30 add eax, edx ; 1230 and ecx, 15 ; "4" -> 4 add eax, ecx ; 1234 MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK exit end start

Hi JJ.

The solution you suggested is quite simple and elegant.
Pure 32 bit ASM code.
I suppose that the code:
AND EAX, 15 is the equivalent of SUB EAX, 48 (maybe faster?)

If we have an array with all the rounded thousands, hundreds, tens
and we use directly the ASCII code as an index to these arrays
and at the end we just add the four elements corresponding to
the four ASCII codes, could we have a faster result?

I think IMUL is slower than ADD, and we don't need the AND REG, 15.

What do you think my friend?

Enjoy

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 12:59:38 PM

Hi mineiro & Frank,

Attached a version that you can use for your algos. TestD and TestE are free - just insert your code as demonstrated below the TestA_s: label.

No MasmBasic required, just the plain Masm32 SDK. Enjoy :thumbsup:

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 01:23:05 PM

Thanks sir jj2007;
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

Code Select


~/.wine/drive_c/FourCharsToDword2v1$ wine FourCharsToDword.exe 
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

449	cycles for 100 * C2D_J
1499	cycles for 100 * atodw
12673	cycles for 100 * sscanf
342	cycles for 100 * mineiro
??	cycles for 100 * testE

442	cycles for 100 * C2D_J
1523	cycles for 100 * atodw
12670	cycles for 100 * sscanf
342	cycles for 100 * mineiro
??	cycles for 100 * testE

435	cycles for 100 * C2D_J
1514	cycles for 100 * atodw
12673	cycles for 100 * sscanf
344	cycles for 100 * mineiro
??	cycles for 100 * testE

435	cycles for 100 * C2D_J
1511	cycles for 100 * atodw
12891	cycles for 100 * sscanf
329	cycles for 100 * mineiro
??	cycles for 100 * testE

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
46	bytes for mineiro
2	bytes for testE

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax mineiro
2	= eax testE

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 10, 2021, 04:44:59 PM

Hi JJ, Mineiro. :eusa_clap:

As supposed, with SSE/SSE2 and more advanced opcodes we can have great results
when we work with multiple registers in parallel operations.

It will take time for me to translate into ASM my ideas, but I presume the results could
be interesting.

Ad maiora.

:thumbsup:

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 07:23:10 PM

Quote from: mineiro on November 10, 2021, 01:23:05 PM
Follow results, please note that code size of my procedure is wrong, is bigger than appears:

Very nice! Code size is calculated including the loop & call part but not including data, such as the 8 bytes of your total variable. On my old i5 your code is a tick slower than mine, but I see it's much faster on your i7 :thumbsup:

This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:

Code Select

movq xmm1,qword ptr [total]	;static values
pxor xmm2,xmm2

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

495     cycles for 100 * C2D_J
1640    cycles for 100 * atodw
21981   cycles for 100 * sscanf
544     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1578    cycles for 100 * atodw
21924   cycles for 100 * sscanf
550     cycles for 100 * mineiro

471     cycles for 100 * C2D_J
1579    cycles for 100 * atodw
21940   cycles for 100 * sscanf
545     cycles for 100 * mineiro

469     cycles for 100 * C2D_J
1577    cycles for 100 * atodw
22206   cycles for 100 * sscanf
545     cycles for 100 * mineiro

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
54      bytes for mineiro

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax mineiro

P.S.: There is a useE=1 at the beginning of the file. Set useE=0 to suppress the unused TestE loop.

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 09:04:43 PM

Quote from: jj2007 on November 10, 2021, 07:23:10 PM
This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
Code Select Expand
movq xmm1,qword ptr [total] ;static values pxor xmm2,xmm2

Inserting that code inside procedure I got a very close value as your code posted before.
I have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

Code Select


424	cycles for 100 * C2D_J
1504	cycles for 100 * atodw
12760	cycles for 100 * sscanf
652	cycles for 100 * mineiro   ;;converting 2 values simultaneously, all code inside procedure

I remember in past when we compete code that we avoid the use of mul/imul in favor of lea. I have tried that "first approach" posted in this topic but performance was not good in my pc (ever preserving register usage (ebx)), something like 700 ~ 720 cycles.

So, let's wait next round sir jj2007.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 10, 2021, 10:40:38 PM

Quote from: mineiro on November 10, 2021, 09:04:43 PMI have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.

I'm not quite sure what your algo does. The other algos get a string pointer as argument:

Code Select

	push eax
	invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
	pop eax

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 10, 2021, 11:17:07 PM

Quote from: jj2007 on November 10, 2021, 10:40:38 PM
I'm not quite sure what your algo does. The other algos get a string pointer as argument:

Well, so we need define some rules.
Your algo does the same as mine, only reversed string being used.

Code Select


	mov eax, "4321"
	call C2D_J

Follow algo that can perform 2 conversions (min2cvt), and that one that uses lea:
I will edit this message and post other computer results:

Code Select


Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

553	cycles for 100 * C2D_J
1487	cycles for 100 * atodw
12890	cycles for 100 * sscanf
739	cycles for 100 * min2cvt
333	cycles for 100 * min1cvt
895	cycles for 100 * min_lea

562	cycles for 100 * C2D_J
1485	cycles for 100 * atodw
12871	cycles for 100 * sscanf
743	cycles for 100 * min2cvt
333	cycles for 100 * min1cvt
895	cycles for 100 * min_lea

566	cycles for 100 * C2D_J
1486	cycles for 100 * atodw
12900	cycles for 100 * sscanf
746	cycles for 100 * min2cvt
332	cycles for 100 * min1cvt
751	cycles for 100 * min_lea

561	cycles for 100 * C2D_J
1559	cycles for 100 * atodw
12603	cycles for 100 * sscanf
742	cycles for 100 * min2cvt
332	cycles for 100 * min1cvt
895	cycles for 100 * min_lea

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
77	bytes for min2cvt
44	bytes for min1cvt
92	bytes for min_lea

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt
1234	= eax min_lea

Code Select


wine FourCharsToDword.exe 
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)

561	cycles for 100 * C2D_J
1912	cycles for 100 * atodw
19520	cycles for 100 * sscanf
672	cycles for 100 * min2cvt
458	cycles for 100 * min1cvt
765	cycles for 100 * min_lea

568	cycles for 100 * C2D_J
1919	cycles for 100 * atodw
19520	cycles for 100 * sscanf
670	cycles for 100 * min2cvt
457	cycles for 100 * min1cvt
767	cycles for 100 * min_lea

561	cycles for 100 * C2D_J
1918	cycles for 100 * atodw
19488	cycles for 100 * sscanf
672	cycles for 100 * min2cvt
456	cycles for 100 * min1cvt
767	cycles for 100 * min_lea

571	cycles for 100 * C2D_J
1912	cycles for 100 * atodw
19498	cycles for 100 * sscanf
671	cycles for 100 * min2cvt
458	cycles for 100 * min1cvt
769	cycles for 100 * min_lea

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
77	bytes for min2cvt
44	bytes for min1cvt
92	bytes for min_lea

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt
1234	= eax min_lea

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 12:32:30 AM

I changed my previous code to be string pointer as argument:
Previous cycles result don't changed.
I removed lea algo.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 11, 2021, 01:34:32 AM

Quote from: mineiro on November 11, 2021, 12:32:30 AM
I changed my previous code to be string pointer as argument

I'm afraid you are not passing a string pointer:

Code Select

		mov eax, "4321"		;first value to be converted
		mov edx, "1234"		;second value to be converted

mov eax, chr$("1234") would pass a string pointer, as in

Code Select

  .Repeat
	push eax
	invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
	pop eax
	dec ebx
  .Until Sign?

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 07:08:29 AM

Quote from: jj2007 on November 11, 2021, 01:34:32 AM
I'm afraid you are not passing a string pointer:
Code Select Expand
mov eax, "4321" ;first value to be converted mov edx, "1234" ;second value to be converted
mov eax, chr$("1234") would pass a string pointer, as in

My and your procedure are passing data direct loaded into eax register while other functions (atodw/crt_sscanf) are passing data pointer by stack.
RULES:
All procedures should pass a pointer to data by using stack, stdcall.
I changed C2D_J,min2cvt and min1cvt.

Code Select


wine FourCharsToDword.exe 
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

557	cycles for 100 * C2D_J
1524	cycles for 100 * atodw
12665	cycles for 100 * sscanf
767	cycles for 100 * min2cvt
429	cycles for 100 * min1cvt

579	cycles for 100 * C2D_J
1499	cycles for 100 * atodw
12649	cycles for 100 * sscanf
785	cycles for 100 * min2cvt
429	cycles for 100 * min1cvt

602	cycles for 100 * C2D_J
1496	cycles for 100 * atodw
12631	cycles for 100 * sscanf
783	cycles for 100 * min2cvt
429	cycles for 100 * min1cvt

591	cycles for 100 * C2D_J
1497	cycles for 100 * atodw
12631	cycles for 100 * sscanf
747	cycles for 100 * min2cvt
431	cycles for 100 * min1cvt

54	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
84	bytes for min2cvt
52	bytes for min1cvt

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 11, 2021, 11:38:43 AM

Excellent, mineiro :thumbsup:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

467     cycles for 100 * C2D_J
1592    cycles for 100 * atodw
21900   cycles for 100 * sscanf
761     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1588    cycles for 100 * atodw
21908   cycles for 100 * sscanf
763     cycles for 100 * min2cvt
543     cycles for 100 * min1cvt

467     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21898   cycles for 100 * sscanf
768     cycles for 100 * min2cvt
549     cycles for 100 * min1cvt

468     cycles for 100 * C2D_J
1591    cycles for 100 * atodw
21924   cycles for 100 * sscanf
767     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt

Title: Re: ASM for FUN NEW step #1
Post by: raymond on November 11, 2021, 03:18:06 PM

I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.
I wrote some code to prepare such an LUT (once which should take only a few nanosecs) then used it to convert an ascii number 1,000,000,000 times (10^9). The timing on my older laptop was 1.5 secs!

Code for preparing LUT

Code Select

.data
      t10   dd    10 DUP(0)
      t100  dd    10 DUP(0)
      t1000 dd    10 DUP(0)

      numbers     db    "3578"

start:

;fill LUTs of multiples
;of 10
      lea   edi,t10
      xor   eax,eax
      mov   edx,10
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;of 100
      xor   eax,eax
      mov   edx,100
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;of 1000
      xor   eax,eax
      mov   edx,1000
      mov   ecx,10
   @@:
      stosd
      add   eax,edx
      dec   ecx
      jnz   @B

;there is NO need to have a look-up table for multiples of units!!!

Then I ran the following code for converting ascii to binary, using the QueryPerformanceFrequency function for the timing

Code Select

      mov   ecx,1000000000
start1:

      lea   esi,numbers
      mov   edx,[esi]
      and   edx,0f0f0f0fh
      movzx ebx,dl            ;1000s
      mov   eax,t1000[ebx*4]
      movzx ebx,dh            ;100s
      add   eax,t100[ebx*4]
      bswap edx
      movzx ebx,dh            ;10s
      add   eax,t10[ebx*4]
      movzx ebx,dl            ;units
      add   eax,ebx

      dec   ecx
      jnz   start1

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 11, 2021, 06:13:44 PM

Code Select

AMD Athlon(tm) II X2 220 Processor (SSE3)

833     cycles for 100 * C2D_J
2410    cycles for 100 * atodw
34220   cycles for 100 * sscanf
1296    cycles for 100 * min2cvt
1057    cycles for 100 * min1cvt

806     cycles for 100 * C2D_J
2437    cycles for 100 * atodw
34498   cycles for 100 * sscanf
1297    cycles for 100 * min2cvt
1053    cycles for 100 * min1cvt

805     cycles for 100 * C2D_J
2436    cycles for 100 * atodw
34558   cycles for 100 * sscanf
1297    cycles for 100 * min2cvt
1055    cycles for 100 * min1cvt

805     cycles for 100 * C2D_J
2410    cycles for 100 * atodw
34503   cycles for 100 * sscanf
1296    cycles for 100 * min2cvt
1060    cycles for 100 * min1cvt

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt

--- ok ---

LUT with Pelles C

Code Select

1: #include <stdio.h>
2: int a10[] = {0,10,20,30,40,50,60,70,80,90};
3: int a100[] = {0,100,200,300,400,500,600,700,800,900};
4: int a1000[] = {0,1000,2000,3000,4000,5000,6000,7000,8000,9000};
5: 
6: int __cdecl main(void)

_main:
  [00000000] 55                     push              ebp
  [00000001] 89E5                   mov               ebp,esp
  [00000003] 83EC08                 sub               esp,8
7: {
8: 	char anum[] = "1234";
  [00000006] C745FB31323334         mov               dword ptr [ebp-5],34333231
  [0000000D] C645FF00               mov               byte ptr [ebp-1],0
9: 	int num;
10: 
11: 	(*(int*)&anum) &= 0x0f0f0f0f;
  [00000011] 8165FB0F0F0F0F         and               dword ptr [ebp-5],F0F0F0F
12: 	num = a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
  [00000018] 0FBE45FB               movsx             eax,byte ptr [ebp-5]
  [0000001C] 8B048500000000         mov               eax,dword ptr [eax*4+_a1000]
  [00000023] 0FBE55FC               movsx             edx,byte ptr [ebp-4]
  [00000027] 03049500000000         add               eax,dword ptr [edx*4+_a100]
  [0000002E] 0FBE55FD               movsx             edx,byte ptr [ebp-3]
  [00000032] 03049500000000         add               eax,dword ptr [edx*4+_a10]
  [00000039] 0FBE55FE               movsx             edx,byte ptr [ebp-2]
  [0000003D] 01D0                   add               eax,edx
13: 	printf("%u\n", num);
  [0000003F] 50                     push              eax
  [00000040] 6800000000             push              @6
  [00000045] E800000000             call              _printf
  [0000004A] 83C408                 add               esp,8
14: 	return 0;
  [0000004D] 31C0                   xor               eax,eax
  [0000004F] 89EC                   mov               esp,ebp
  [00000051] 5D                     pop               ebp
  [00000052] C3                     ret

using temporary string

Code Select


	char anum2[] = "1234";
	char anum[5];
	int num;

12: 	(*(int*)&anum) = (*(int*)&anum2) & 0x0f0f0f0f;
  [00000011] 8B45FB                 mov               eax,dword ptr [ebp-5]
  [00000014] 250F0F0F0F             and               eax,F0F0F0F
  [00000019] 8945F6                 mov               dword ptr [ebp-A],eax

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 11, 2021, 08:28:03 PM

Quote from: raymond on November 11, 2021, 03:18:06 PM
I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.

Thank you, Raymond. Here it is as ConvertLUT:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

475     cycles for 100 * C2D_J
1597    cycles for 100 * atodw
22224   cycles for 100 * sscanf
772     cycles for 100 * min2cvt
555     cycles for 100 * min1cvt
640     cycles for 100 * ConvertLUT

497     cycles for 100 * C2D_J
1595    cycles for 100 * atodw
22401   cycles for 100 * sscanf
800     cycles for 100 * min2cvt
584     cycles for 100 * min1cvt
631     cycles for 100 * ConvertLUT

489     cycles for 100 * C2D_J
1598    cycles for 100 * atodw
22796   cycles for 100 * sscanf
776     cycles for 100 * min2cvt
551     cycles for 100 * min1cvt
640     cycles for 100 * ConvertLUT

50      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 09:44:28 PM

Good job sir raymond and sir TimoVJL;
I created a static aligned LUT table to your procedures in data section.
I passed data pointer by stack, so, inserted a few modifications to deal with this in both codes.

raymond; when I changed bswap in your code by shr and respective changes it performed a bit better in my machine.
I'm leaving original code in testcase, only changed ebx to ecx register usage to follow register preservations without need of push/pop ebx.

TimoVJL; your code suffered performance because value being passed by stack instead of data pointer being passed by stack. I know this can be done inside our code, but just to follow rules. I have tried to insert same overhead in all procedures so result should be honest.
After "and 0f0f0f0fh" I inserted value back to stack. So, I confess it's not the best try because data is in register and go back to stack and read stack data digits each time. If you know better modifications please tell.

I have tried to not touch in yours code, only adapt to windows rules.

sir jj2007; in your testcase some procedures are getting data pointer from register while other procedures are receiving that from stack. This is creating different overheads. In first case it's necessary "mov reg,[reg]" while in other "mov reg,[esp+4] mov reg,[reg]", and at end of procedure first procedure do only "retn" while other "ret 4". Suggestions?

Code Select


wine FourCharsToDword.exe 
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

562	cycles for 100 * C2D_J
1482	cycles for 100 * atodw
12649	cycles for 100 * sscanf
752	cycles for 100 * min2cvt
477	cycles for 100 * min1cvt
471	cycles for 100 * ray_LUT
741	cycles for 100 * TimoVJL_PellesC_LUT

638	cycles for 100 * C2D_J
1480	cycles for 100 * atodw
12835	cycles for 100 * sscanf
756	cycles for 100 * min2cvt
433	cycles for 100 * min1cvt
476	cycles for 100 * ray_LUT
744	cycles for 100 * TimoVJL_PellesC_LUT

564	cycles for 100 * C2D_J
1505	cycles for 100 * atodw
12630	cycles for 100 * sscanf
778	cycles for 100 * min2cvt
427	cycles for 100 * min1cvt
470	cycles for 100 * ray_LUT
729	cycles for 100 * TimoVJL_PellesC_LUT

561	cycles for 100 * C2D_J
1478	cycles for 100 * atodw
12558	cycles for 100 * sscanf
751	cycles for 100 * min2cvt
432	cycles for 100 * min1cvt
465	cycles for 100 * ray_LUT
730	cycles for 100 * TimoVJL_PellesC_LUT

54	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
84	bytes for min2cvt
52	bytes for min1cvt
62	bytes for ray_LUT
70	bytes for TimoVJL_PellesC_LUT

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt
1234	= eax ray_LUT
1234	= eax TimoVJL_PellesC_LUT

PS: sir jj2007 told me in pm that my executables are not running from GUI, so, you must open console to run; or if someone can create an executable direct using windows just post that. I'm doing these thing using linux. Appreciate.

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 09:57:43 PM

These are results from jj2007 testcase:

Code Select


wine FourCharsToDword.exe 
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

479	cycles for 100 * C2D_J
1528	cycles for 100 * atodw
12745	cycles for 100 * sscanf
725	cycles for 100 * min2cvt
493	cycles for 100 * min1cvt
502	cycles for 100 * ConvertLUT

478	cycles for 100 * C2D_J
1610	cycles for 100 * atodw
12623	cycles for 100 * sscanf
745	cycles for 100 * min2cvt
495	cycles for 100 * min1cvt
501	cycles for 100 * ConvertLUT

550	cycles for 100 * C2D_J
1483	cycles for 100 * atodw
12590	cycles for 100 * sscanf
741	cycles for 100 * min2cvt
581	cycles for 100 * min1cvt
500	cycles for 100 * ConvertLUT

50	bytes for C2D_J
10	bytes for atodw
22	bytes for sscanf
98	bytes for min2cvt
62	bytes for min1cvt
118	bytes for ConvertLUT

1234	= eax C2D_J
1234	= eax atodw
1234	= eax sscanf
1234	= eax min2cvt
1234	= eax min1cvt
1234	= eax ConvertLUT

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 11, 2021, 11:34:15 PM

Thanks for the tip sir jj2007;
I checked command line when building executable and subsystem:WINDOWS was default; I changed that to CONSOLE.
Link version that I was using is v8.???, changed that to Version 5.12.8078.
So, I suppose now can work by double clicking.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 12, 2021, 01:03:01 AM

Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:

Code Select

min1cvt proc _data:dword
option prologue:none
option epilogue:none

mov ecx,[esp+4]
mov eax,[ecx]
xor eax,"0000"      ;and eax,0f0f0f0fh
movd xmm0,eax       ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1   ;mul and add
;---------------------------
				;db 66h,0fh,38h,02h,0c0h	;          phaddd xmm0,xmm0	;SSE3
				;movd eax,xmm0

movd eax,xmm0
psrlq xmm0,32
movd ecx,xmm0
add eax,ecx
ret 4
min1cvt endp

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

655     cycles for 100 * C2D_J
1605    cycles for 100 * atodw
22178   cycles for 100 * sscanf
785     cycles for 100 * min2cvt
475     cycles for 100 * min1cvt
599     cycles for 100 * ray_LUT
661     cycles for 100 * TimoVJL_PellesC_LUT

651     cycles for 100 * C2D_J
1599    cycles for 100 * atodw
22808   cycles for 100 * sscanf
823     cycles for 100 * min2cvt
508     cycles for 100 * min1cvt
607     cycles for 100 * ray_LUT
717     cycles for 100 * TimoVJL_PellesC_LUT

714     cycles for 100 * C2D_J
1615    cycles for 100 * atodw
22657   cycles for 100 * sscanf
778     cycles for 100 * min2cvt
488     cycles for 100 * min1cvt
587     cycles for 100 * ray_LUT
657     cycles for 100 * TimoVJL_PellesC_LUT

696     cycles for 100 * C2D_J
1613    cycles for 100 * atodw
22193   cycles for 100 * sscanf
778     cycles for 100 * min2cvt
487     cycles for 100 * min1cvt
595     cycles for 100 * ray_LUT
664     cycles for 100 * TimoVJL_PellesC_LUT

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
84      bytes for min2cvt
52      bytes for min1cvt
62      bytes for ray_LUT
70      bytes for TimoVJL_PellesC_LUT

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ray_LUT
1234    = eax TimoVJL_PellesC_LUT

It's a pity that this is limited to the exotic test case "four digits" :sad:

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 12, 2021, 02:00:42 AM

Quote from: jj2007 on November 12, 2021, 01:03:01 AM
It's a pity that this is limited to the exotic test case "four digits" :sad:

its not,thanks everyone for testing different approaches to fast ascii->integer :thumbsup:
there is still fastest ascii->double and fastest ascii->64bit integer

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 12, 2021, 06:13:25 AM

Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:

Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.

I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 12, 2021, 06:54:11 AM

Quote from: mineiro on November 12, 2021, 06:13:25 AM
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.

I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...

change to newer assembler than ml 6.14 solves it without need for macros

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 12, 2021, 08:08:40 AM

Quote from: daydreamer on November 12, 2021, 06:54:11 AM
change to newer assembler than ml 6.14 solves it without need for macros

Thanks sir daydreamer.

Title: Re: ASM for FUN NEW step #1
Post by: frktons on November 12, 2021, 09:28:27 AM

That's great Guys.

I am happy to see many options are coming out and the speed of the code is quite nice. :thumbsup:

Compliments to everyone partecipating in the game. :eusa_clap:

Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 12, 2021, 11:03:09 AM

Here is another approach, a finite state machine, AKA look up tree. 4 character strings fed in, integers fed out. Its big but its reasonably fast. Timings below are with my old i7.

Output

2
9999
5432
8619
9876
2345
1234
7776
8193
4352
test return values - done

timing 10 million matches next, press any key ....

ms = 109 duration
Press any key to continue ...

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 12, 2021, 11:45:23 AM

Finite state machine, AKA look up tree vs my algo. I hope I didn't make any mistakes in translation. The attachment does not contain the num.asm file - see Hutch' post.

Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 12, 2021, 12:31:28 PM

:thumbsup:

That is genuinely fast and it looks like its producing the right output numbers.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 12, 2021, 01:01:42 PM

I managed to integrate Hutch' finite state machine into the testbed. To assemble it, the num.asm file must be in the same folder. It's still plain Masm32 SDK, MasmBasic is not required. The min1cvt algo is a clear winner :thumbsup:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

665     cycles for 100 * C2D_J
1600    cycles for 100 * atodw
21983   cycles for 100 * sscanf
766     cycles for 100 * min2cvt
545     cycles for 100 * min1cvt
683     cycles for 100 * ConvertLUT
2292    cycles for 100 * FSM (Hutch)

690     cycles for 100 * C2D_J
1605    cycles for 100 * atodw
21970   cycles for 100 * sscanf
769     cycles for 100 * min2cvt
544     cycles for 100 * min1cvt
653     cycles for 100 * ConvertLUT
2326    cycles for 100 * FSM (Hutch)

660     cycles for 100 * C2D_J
1612    cycles for 100 * atodw
22241   cycles for 100 * sscanf
770     cycles for 100 * min2cvt
547     cycles for 100 * min1cvt
626     cycles for 100 * ConvertLUT
2263    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 12, 2021, 05:52:59 PM

The attached zip file contains the source for the FSM, it is every number between 0000 and 9999 and will test all of them if you make the function call.

It occurs in this form.

0000 0000
0001 0001
0002 0002
0003 0003
0004 0004
............
9995 9995
9996 9996
9997 9997
9998 9998
9999 9999

Pass the first string to the integers procedure and it will return the actual integer in EAX.

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 12, 2021, 06:40:44 PM

Code Select

AMD Athlon(tm) II X2 220 Processor (SSE3)

1186    cycles for 100 * C2D_J
2428    cycles for 100 * atodw
35137   cycles for 100 * sscanf
1304    cycles for 100 * min2cvt
1059    cycles for 100 * min1cvt
712     cycles for 100 * ConvertLUT
2780    cycles for 100 * FSM (Hutch)

1189    cycles for 100 * C2D_J
2426    cycles for 100 * atodw
34990   cycles for 100 * sscanf
1305    cycles for 100 * min2cvt
1065    cycles for 100 * min1cvt
721     cycles for 100 * ConvertLUT
2738    cycles for 100 * FSM (Hutch)

1185    cycles for 100 * C2D_J
2427    cycles for 100 * atodw
34735   cycles for 100 * sscanf
1311    cycles for 100 * min2cvt
1058    cycles for 100 * min1cvt
710     cycles for 100 * ConvertLUT
3155    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

--- ok ---

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

532     cycles for 100 * C2D_J
1401    cycles for 100 * atodw
25676   cycles for 100 * sscanf
523     cycles for 100 * min2cvt
435     cycles for 100 * min1cvt
615     cycles for 100 * ConvertLUT
2124    cycles for 100 * FSM (Hutch)

524     cycles for 100 * C2D_J
1371    cycles for 100 * atodw
26048   cycles for 100 * sscanf
524     cycles for 100 * min2cvt
431     cycles for 100 * min1cvt
618     cycles for 100 * ConvertLUT
2129    cycles for 100 * FSM (Hutch)

512     cycles for 100 * C2D_J
1383    cycles for 100 * atodw
25954   cycles for 100 * sscanf
517     cycles for 100 * min2cvt
434     cycles for 100 * min1cvt
615     cycles for 100 * ConvertLUT
2127    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
22      bytes for sscanf
98      bytes for min2cvt
62      bytes for min1cvt
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax sscanf
1234    = eax min2cvt
1234    = eax min1cvt
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

-

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 13, 2021, 11:45:28 PM

Example how msvc C optimizer works:

Code Select

int str4int( char *s)
{
	char anum[5];
	(*(int*)&anum) = (*(int*)s) & 0x0f0f0f0f;
	return a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
}

Code Select

_str4int:
  [00000000] 8B442404               mov               eax,dword ptr [esp+4]
  [00000004] 53                     push              ebx
  [00000005] 8B18                   mov               ebx,dword ptr [eax]
  [00000007] 81E30F0F0F0F           and               ebx,F0F0F0Fh
  [0000000D] 8BC3                   mov               eax,ebx
  [0000000F] C1E810                 shr               eax,10h
  [00000012] 0FB6D0                 movzx             edx,al
  [00000015] 8BC3                   mov               eax,ebx
  [00000017] C1E808                 shr               eax,8
  [0000001A] 0FB6C8                 movzx             ecx,al
  [0000001D] 8B049500000000         mov               eax,dword ptr [edx*4+_a10]
  [00000024] 03048D00000000         add               eax,dword ptr [ecx*4+_a100]
  [0000002B] 0FB6CB                 movzx             ecx,bl
  [0000002E] C1EB18                 shr               ebx,18h
  [00000031] 03048D00000000         add               eax,dword ptr [ecx*4+_a1000]
  [00000038] 03C3                   add               eax,ebx
  [0000003A] 5B                     pop               ebx
  [0000003B] C3                     ret

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 14, 2021, 12:50:56 AM

New version, my algo and mineiro's are somewhat faster now:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

554     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
776     cycles for 100 * min2cvt
476     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2273    cycles for 100 * FSM (Hutch)

545     cycles for 100 * C2D_J
1575    cycles for 100 * atodw
768     cycles for 100 * min2cvt
478     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2252    cycles for 100 * FSM (Hutch)

548     cycles for 100 * C2D_J
1594    cycles for 100 * atodw
765     cycles for 100 * min2cvt
478     cycles for 100 * min1cvt (SSE3)
637     cycles for 100 * ConvertLUT
2262    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

@Timo: I've tried the MSVC optimised version, but it's 16% slower and gives a wrong result :sad:
See below C2D_J: (or search for msvc inside the file)

Title: Re: ASM for FUN NEW step #1
Post by: mineiro on November 14, 2021, 01:23:17 AM

In my machine, "shr reg,16" performs better than "bswap", and respective changes after this modification. The final gain in cycles is about 010~013 cycles.
ConvertLUT can be optimized a bit more, by removing "push/pop ebx" and using ecx register instead.

These are results of last benchmark test:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

Code Select


513	cycles for 100 * C2D_J
1475	cycles for 100 * atodw
687	cycles for 100 * min2cvt
463	cycles for 100 * min1cvt (SSE3)
640	cycles for 100 * ConvertLUT
1383	cycles for 100 * FSM (Hutch)

467	cycles for 100 * C2D_J
1477	cycles for 100 * atodw
696	cycles for 100 * min2cvt
456	cycles for 100 * min1cvt (SSE3)
639	cycles for 100 * ConvertLUT
1383	cycles for 100 * FSM (Hutch)

456	cycles for 100 * C2D_J
1474	cycles for 100 * atodw
702	cycles for 100 * min2cvt
503	cycles for 100 * min1cvt (SSE3)
494	cycles for 100 * ConvertLUT
1431	cycles for 100 * FSM (Hutch)

54	bytes for C2D_J
10	bytes for atodw
98	bytes for min2cvt
50	bytes for min1cvt (SSE3)
118	bytes for ConvertLUT
8	bytes for FSM (Hutch)

1234	= eax C2D_J
1234	= eax atodw
1234	= eax min2cvt
1234	= eax min1cvt (SSE3)
1234	= eax ConvertLUT
1234	= eax FSM (Hutch)

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 14, 2021, 01:56:33 AM

Code Select

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

453     cycles for 100 * C2D_J
1476    cycles for 100 * atodw
745     cycles for 100 * min2cvt
483     cycles for 100 * min1cvt (SSE3)
477     cycles for 100 * ConvertLUT
1397    cycles for 100 * FSM (Hutch)

457     cycles for 100 * C2D_J
1458    cycles for 100 * atodw
695     cycles for 100 * min2cvt
461     cycles for 100 * min1cvt (SSE3)
470     cycles for 100 * ConvertLUT
1366    cycles for 100 * FSM (Hutch)

450     cycles for 100 * C2D_J
1456    cycles for 100 * atodw
692     cycles for 100 * min2cvt
471     cycles for 100 * min1cvt (SSE3)
473     cycles for 100 * ConvertLUT
1374    cycles for 100 * FSM (Hutch)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)

-

@minerio about shr reg,16 vs bswap
with your own SSE version you should try if byte shuffle is faster?
my cpu is haswell based,so it supports avx2,anyone more than hutch that can test avx2 code?

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 14, 2021, 08:15:40 AM

A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 14, 2021, 11:13:07 AM

Quote from: TimoVJL on November 14, 2021, 08:15:40 AM
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.

For me it didn't work, the result is 1114 instead of 1234. Please search for msvc inside FourCharsToDword.asm, it should be line 74; change line 56 to if 0 to test the msvc code. Maybe I made an error when copying & pasting your code :sad:

Title: Re: ASM for FUN NEW step #1
Post by: hutch-- on November 14, 2021, 12:43:24 PM

To appeal to your sense of humour, I split the FSM into 4 procedures, much smaller file, got it to produce the right numbers but it was about 30% slower than the single FSM procedure. :undecided:

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 14, 2021, 11:27:52 PM

Ok, here is version 5, with Timo's MSVC "optimised" algo. On my machine, mineiro's SSE3 algo is a clear winner:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

545     cycles for 100 * C2D_J
1587    cycles for 100 * atodw
769     cycles for 100 * min2cvt
503     cycles for 100 * min1cvt (SSE3)
626     cycles for 100 * ConvertLUT
2267    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

550     cycles for 100 * C2D_J
1593    cycles for 100 * atodw
768     cycles for 100 * min2cvt
477     cycles for 100 * min1cvt (SSE3)
624     cycles for 100 * ConvertLUT
2244    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

547     cycles for 100 * C2D_J
1589    cycles for 100 * atodw
768     cycles for 100 * min2cvt
483     cycles for 100 * min1cvt (SSE3)
624     cycles for 100 * ConvertLUT
2246    cycles for 100 * FSM (Hutch)
698     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 15, 2021, 12:23:28 AM

Minieros also my favourite
, but for ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 15, 2021, 12:46:38 AM

Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Really? How exactly would you do that?

Code Select

ConvertLUT:
  mov edx, [eax]	; pointer to string, e.g. "1234"
  and edx, 0f0f0f0fh	; convert "1" to 1
  movzx ecx, dl		;1000s
  mov eax, t1000[ecx*4]
  movzx ecx,dh		;100s
  add eax, t100[ecx*4]
  bswap edx
  movzx ecx,dh		;10s
  add eax, t10[ecx*4]
  movzx ecx, dl		;units
  add eax, ecx
  retn

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 15, 2021, 02:15:03 AM

Sadly test program don't work with AMD Athlon(tm) II X2 220 Processor (SSE3)

Code Select

AMD Athlon(tm) II X2 220 Processor (SSE3)

909     cycles for 100 * C2D_J
2480    cycles for 100 * atodw
1301    cycles for 100 * min2cvt

Title: Re: ASM for FUN NEW step #1
Post by: FORTRANS on November 15, 2021, 03:06:24 AM

Code Select

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

556	cycles for 100 * C2D_J
1699	cycles for 100 * atodw
847	cycles for 100 * min2cvt
553	cycles for 100 * min1cvt (SSE3)
588	cycles for 100 * ConvertLUT
1597	cycles for 100 * FSM (Hutch)

552	cycles for 100 * C2D_J
1701	cycles for 100 * atodw
847	cycles for 100 * min2cvt
557	cycles for 100 * min1cvt (SSE3)
586	cycles for 100 * ConvertLUT
1596	cycles for 100 * FSM (Hutch)

554	cycles for 100 * C2D_J
1701	cycles for 100 * atodw
846	cycles for 100 * min2cvt
565	cycles for 100 * min1cvt (SSE3)
588	cycles for 100 * ConvertLUT
1587	cycles for 100 * FSM (Hutch)

54	bytes for C2D_J
10	bytes for atodw
98	bytes for min2cvt
50	bytes for min1cvt (SSE3)
118	bytes for ConvertLUT
8	bytes for FSM (Hutch)

1234	= eax C2D_J
1234	= eax atodw
1234	= eax min2cvt
1234	= eax min1cvt (SSE3)
1234	= eax ConvertLUT
1234	= eax FSM (Hutch)

--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

348	cycles for 100 * C2D_J
1080	cycles for 100 * atodw
501	cycles for 100 * min2cvt
323	cycles for 100 * min1cvt (SSE3)
359	cycles for 100 * ConvertLUT
1019	cycles for 100 * FSM (Hutch)

338	cycles for 100 * C2D_J
1051	cycles for 100 * atodw
509	cycles for 100 * min2cvt
366	cycles for 100 * min1cvt (SSE3)
385	cycles for 100 * ConvertLUT
1015	cycles for 100 * FSM (Hutch)

330	cycles for 100 * C2D_J
1119	cycles for 100 * atodw
516	cycles for 100 * min2cvt
326	cycles for 100 * min1cvt (SSE3)
376	cycles for 100 * ConvertLUT
1041	cycles for 100 * FSM (Hutch)

54	bytes for C2D_J
10	bytes for atodw
98	bytes for min2cvt
50	bytes for min1cvt (SSE3)
118	bytes for ConvertLUT
8	bytes for FSM (Hutch)

1234	= eax C2D_J
1234	= eax atodw
1234	= eax min2cvt
1234	= eax min1cvt (SSE3)
1234	= eax ConvertLUT
1234	= eax FSM (Hutch)

--- ok ---

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 15, 2021, 03:15:12 AM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

536     cycles for 100 * C2D_J
1365    cycles for 100 * atodw
517     cycles for 100 * min2cvt
353     cycles for 100 * min1cvt (SSE3)
615     cycles for 100 * ConvertLUT
2123    cycles for 100 * FSM (Hutch)
618     cycles for 100 * MSVC (Timo)

546     cycles for 100 * C2D_J
1363    cycles for 100 * atodw
516     cycles for 100 * min2cvt
350     cycles for 100 * min1cvt (SSE3)
615     cycles for 100 * ConvertLUT
2152    cycles for 100 * FSM (Hutch)
620     cycles for 100 * MSVC (Timo)

536     cycles for 100 * C2D_J
1371    cycles for 100 * atodw
568     cycles for 100 * min2cvt
350     cycles for 100 * min1cvt (SSE3)
610     cycles for 100 * ConvertLUT
2142    cycles for 100 * FSM (Hutch)
624     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 15, 2021, 06:16:56 AM

Quote from: jj2007 on November 15, 2021, 12:46:38 AM
Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Really? How exactly would you do that?

Code Select Expand
ConvertLUT: mov edx, [eax] ; pointer to string, e.g. "1234" and edx, 0f0f0f0fh ; convert "1" to 1 movzx ecx, dl ;1000s mov eax, t1000[ecx*4] movzx ecx,dh ;100s add eax, t100[ecx*4] bswap edx movzx ecx,dh ;10s add eax, t10[ecx*4] movzx ecx, dl ;units add eax, ecx retn

Code Select


.data
;integer part
t1000 real4 0.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0
t100 real4 0.0,100.0,200.0,300.0,400.0,500.0,600.0,700.0,800.0,900.0
t10 real4 0.0,10.0,20.0,30.0,40.0,50.0,60.0,70.0,80.0,90.0
t1 real4 0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
;decimals
t0dot1 real4 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
t0dot01 real4 0.0,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09
t0dot001 real4 0.0,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009
t0dot0001 real4 0.0,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.0009

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 15, 2021, 07:22:34 AM

Ok, these are the data, and now the code please, daydreamer :thumbsup:

Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 03:18:06 AM

deleted

Title: Re: ASM for FUN NEW step #1
Post by: raymond on November 16, 2021, 04:21:42 AM

Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.

The qeditor Help section has a listing for opcodes, along with the expected clock cycles for each. The BSWAP one has the following:

QuoteUsage: BSWAP reg32

Modifies flags: none

Changes the byte order of a 32 bit register from big endian to
little endian or vice versa. Result left in destination register
is undefined if the operand is a 16 bit register.

Clocks Size
Operands 808x 286 386 486 Bytes
reg32 - - - 1 2

0F C8+ rd BSWAP r32 Reverses the byte order of a 32-bit register.

If you consider BSWAP as a "relatively slow instruction", maybe Hutch should review all the tables of that Help section.

Then, in addition, you mention that "IMUL is actually a rather fast instruction", BUT the imul instruction help section in qeditor shows the following:

Quote
Clocks Size
Operands 808x 286 386 486 Bytes
reg8 80-98 13 9-14 13-18 2
reg16 128-154 21 9-22 13-26 2
reg32 - - 9-38 12-42 2
mem8 86-104 16 12-17 13-18 2-4
mem16 134-160 24 12-25 13-26 2-4
mem32 - - 12-41 13-42 2-4
reg16,reg16 - - 9-22 13-26 3-5
reg32,reg32 - - 9-38 13-42 3-5
reg16,mem16 - - 12-25 13-26 3-5
reg32,mem32 - - 12-41 13-42 3-5
reg16,immed - 21 9-22 13-26 3
reg32,immed - 21 9-38 13-42 3-6
reg16,reg16,immed - 2 9-22 13-26 3-6
reg32,reg32,immed - 21 9-38 13-42 3-6
reg16,mem16,immed - 24 12-25 13-26 3-6
reg32,mem32,immed - 24 12-41 13-42 3-6

Any explanation for this????
Could you provide us with YOUR sources for clock cycles?

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 16, 2021, 04:41:42 AM

Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.
...
86804 cycles 3.asm: pmaddwd
89518 cycles 1.asm: imul
94277 cycles 4.asm: pmaddwd+bswap
108254 cycles 2.asm: imul+bswap[/tt]

Thanks, Nidud. Here are my results:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

546     cycles for 100 * C2D_J
793     cycles for 100 * imul (Nidud)
762     cycles for 100 * min2cvt
479     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2243    cycles for 100 * FSM (Hutch)
709     cycles for 100 * MSVC (Timo)

554     cycles for 100 * C2D_J
793     cycles for 100 * imul (Nidud)
775     cycles for 100 * min2cvt
476     cycles for 100 * min1cvt (SSE3)
627     cycles for 100 * ConvertLUT
2245    cycles for 100 * FSM (Hutch)
696     cycles for 100 * MSVC (Timo)

546     cycles for 100 * C2D_J
792     cycles for 100 * imul (Nidud)
766     cycles for 100 * min2cvt
475     cycles for 100 * min1cvt (SSE3)
643     cycles for 100 * ConvertLUT
2252    cycles for 100 * FSM (Hutch)
703     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

Quote from: raymond on November 16, 2021, 04:21:42 AM
Could you provide us with YOUR sources for clock cycles?

I don't know what Nidud's sources are, but here is mine - Agner Fog: one cycle for bswap (https://www.agner.org/optimize/instruction_tables.pdf)

Title: Re: ASM for FUN NEW step #1
Post by: TimoVJL on November 16, 2021, 05:14:19 AM

Code Select

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

537     cycles for 100 * C2D_J
723     cycles for 100 * imul (Nidud)
571     cycles for 100 * min2cvt
339     cycles for 100 * min1cvt (SSE3)
521     cycles for 100 * ConvertLUT
2133    cycles for 100 * FSM (Hutch)
630     cycles for 100 * MSVC (Timo)

542     cycles for 100 * C2D_J
632     cycles for 100 * imul (Nidud)
510     cycles for 100 * min2cvt
340     cycles for 100 * min1cvt (SSE3)
510     cycles for 100 * ConvertLUT
2134    cycles for 100 * FSM (Hutch)
621     cycles for 100 * MSVC (Timo)

537     cycles for 100 * C2D_J
639     cycles for 100 * imul (Nidud)
511     cycles for 100 * min2cvt
341     cycles for 100 * min1cvt (SSE3)
510     cycles for 100 * ConvertLUT
2131    cycles for 100 * FSM (Hutch)
619     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 05:39:56 AM

deleted

Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 06:33:20 AM

deleted

Title: Re: ASM for FUN NEW step #1
Post by: daydreamer on November 16, 2021, 07:45:15 AM

wonder how fast avx2 will be?

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 16, 2021, 08:26:41 AM

Quote from: daydreamer on November 16, 2021, 07:45:15 AM
wonder how fast avx2 will be?

We are still waiting for your code in reply #61, daydreamer.

Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 08:47:36 AM

deleted

Title: Re: ASM for FUN NEW step #1
Post by: nidud on November 16, 2021, 09:21:23 AM

deleted

Title: Re: ASM for FUN NEW step #1
Post by: Siekmanski on November 16, 2021, 09:52:41 AM

Code Select

AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

343     cycles for 100 * C2D_J
660     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
346     cycles for 100 * min1cvt (SSE3)
424     cycles for 100 * ConvertLUT
984     cycles for 100 * FSM (Hutch)
350     cycles for 100 * MSVC (Timo)

343     cycles for 100 * C2D_J
659     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
347     cycles for 100 * min1cvt (SSE3)
424     cycles for 100 * ConvertLUT
970     cycles for 100 * FSM (Hutch)
546     cycles for 100 * MSVC (Timo)

341     cycles for 100 * C2D_J
662     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
354     cycles for 100 * min1cvt (SSE3)
412     cycles for 100 * ConvertLUT
969     cycles for 100 * FSM (Hutch)
347     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

-

Title: Re: ASM for FUN NEW step #1
Post by: Greenhorn on November 16, 2021, 10:27:52 AM

Code Select

AMD Ryzen 7 3700X 8-Core Processor              (SSE4)

362	cycles for 100 * C2D_J
496	cycles for 100 * imul (Nidud)
365	cycles for 100 * min2cvt
282	cycles for 100 * min1cvt (SSE3)
438	cycles for 100 * ConvertLUT
1093	cycles for 100 * FSM (Hutch)
304	cycles for 100 * MSVC (Timo)

362	cycles for 100 * C2D_J
494	cycles for 100 * imul (Nidud)
365	cycles for 100 * min2cvt
277	cycles for 100 * min1cvt (SSE3)
439	cycles for 100 * ConvertLUT
1092	cycles for 100 * FSM (Hutch)
292	cycles for 100 * MSVC (Timo)

374	cycles for 100 * C2D_J
495	cycles for 100 * imul (Nidud)
364	cycles for 100 * min2cvt
279	cycles for 100 * min1cvt (SSE3)
437	cycles for 100 * ConvertLUT
1100	cycles for 100 * FSM (Hutch)
292	cycles for 100 * MSVC (Timo)

54	bytes for C2D_J
58	bytes for imul (Nidud)
98	bytes for min2cvt
50	bytes for min1cvt (SSE3)
122	bytes for ConvertLUT
8	bytes for FSM (Hutch)
70	bytes for MSVC (Timo)

1234	= eax C2D_J
1234	= eax atodw
1234	= eax imul (Nidud)
1234	= eax min2cvt
1234	= eax min1cvt (SSE3)
1234	= eax ConvertLUT
1234	= eax FSM (Hutch)
1234	= eax MSVC (Timo)

--- ok ---

Title: Re: ASM for FUN NEW step #1
Post by: jj2007 on November 16, 2021, 10:38:23 AM

I opened a new thread called "Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,.. (http://masm32.com/board/index.php?topic=9638.0)" in the Lab :cool:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23      cycles for 100 * imul 10
54      cycles for 100 * lea: *10
4871    cycles for 100 * lodsd (25 DWORDs)
4863    cycles for 100 * mov eax, [esi] + add esi, 4
58      cycles for 100 * lea10, add eax
57      cycles for 100 * lea10, shl eax, 1
27      cycles for 100 * bswap
62      cycles for 100 * ror 16

Text only | Text with Images

SMF 2.1.4 © 2023, Simple Machines