Text only | Text with Images

The MASM Forum

General => The Laboratory => Topic started by: frktons on December 09, 2012, 02:49:23 AM

Title: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 02:49:23 AM

I'd like to fill an array of 1000 dword with the string values ranging
from ' 0' to ' 999'.
To avoid 400k increment in program size and writing manually 1000
values in .data, I prefer to declare the array in .data? and fill the
array with a proc: call FillArray.
I've some ideas on how to to that, but before starting the tests I'd like
your suggestions, code, already done experiments.. to think upon.

Let me know.

Frank

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 03:04:46 AM

Code Select

FillArray PROC USES EBX ESI EDI

        mov     edi,offset MyArray   ;or whatever you name it - it should be 4-aligned
        xor     eax,eax
        mov     ebx,1
        mov     edx,2
        mov     esi,3
        mov     ecx,1000/4

FArry0: mov     [edi],eax
        mov     [edi+4],ebx
        mov     [edi+8],edx
        mov     [edi+12],esi
        add     eax,4
        add     ebx,4
        add     edx,4
        add     esi,4
        add     edi,16
        sub     ecx,1
        jnz     FArry0

        ret

FillArray ENDP

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 03:09:33 AM

Quote from: dedndave on December 09, 2012, 03:04:46 AM
Code Select Expand
FillArray PROC USES EBX ESI EDI mov edi,offset MyArray ;or whatever you name it - it should be 4-aligned xor eax,eax mov ebx,1 mov edx,2 mov esi,3 mov ecx,1000/4 FArry0: mov [edi],eax mov [edi+4],ebx mov [edi+8],edx mov [edi+12],esi add eax,4 add ebx,4 add edx,4 add esi,4 add edi,16 sub ecx,1 jnz FArry0 ret FillArray ENDP

Hi Dave, I'm not sure this routine matchs my need.
You sure this routine fills the array with strings 4
bytes long containing the ascii representation of
numbers 0-999 with leading spaces?

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 03:13:25 AM

ohhhhhhhhhhhhhhh
no - i am sure that it does not - lol
it fills the array with the binary values

give me a minute.....

EDIT: it is run once at the beginning of the program
so, i would not think speed is super-critical, anyways

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 03:21:28 AM

Code Select

FillArray PROC USES EBX ESI EDI

        mov     edi,offset MyArray   ;or whatever you name it
        xor     esi,esi
        mov     ebx,1000

FArry0: INVOKE  crt__ultoa,esi,edi,10
        add     edi,4
        inc     esi
        dec     ebx
        jnz     FArry0

        ret

FillArray ENDP

Title: Re: The fastest way to fill a dword array with string values
Post by: raymond on December 09, 2012, 03:31:05 AM

Before you design an algo, you first need to specify:
(i) if the ascii representations will be left-aligned or right-aligned within its 4-byte space when you print that 4000-byte string,
(ii) if right-aligned, should it have leading 0's or leading spaces for the numbers below 100,
(iii)which character do you need as the 4th character since each number needs at most 3 characters in ascii format.

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 03:42:11 AM

Quote from: raymond on December 09, 2012, 03:31:05 AM
Before you design an algo, you first need to specify:
(i) if the ascii representations will be left-aligned or right-aligned within its 4-byte space when you print that 4000-byte string,
(ii) if right-aligned, should it have leading 0's or leading spaces for the numbers below 100,
(iii)which character do you need as the 4th character since each number needs at most 3 characters in ascii format.

If you read the first post, all these questions will be answered.
All the answers lies in " 0" and " 999". :P

Dave: the use of crt__ultoa makes me doubt it will be the fastest around.

In a few days we'll see. :t

Title: Re: The fastest way to fill a dword array with string values
Post by: TouEnMasm on December 09, 2012, 03:52:45 AM

Seems there is not to many soluces to be fast.
put 30h in a register = 0 >>> in memory
cmp 3ah
inc = 1 >>> //
..
inc 3Ah = 3130h = 10 >>> in memory
inc
cmp ...

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 04:10:38 AM

Quote from: ToutEnMasm on December 09, 2012, 03:52:45 AM

Seems there is not to many soluces to be fast.
put 30h in a register = 0 >>> in memory
cmp 3ah
inc = 1 >>> //
..
inc 3Ah = 3130h = 10 >>> in memory
inc
cmp ...

This is a good solution, but I don't think it is the fastest around.
By the way if you want to post your test results we can compare
them with other solutions.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 04:18:05 AM

i didn't test it - and it may have a bug or 2 to work out
but, you get the idea

Code Select

FillArray PROC

        mov     edx,offset MyArray   ;or whatever you name it - it should be 4-aligned
        mov     eax,302020h          ;EAX = "  0",0
        jmp short FArry4

FArry0: and     eax,030FFFFh
        cmp     ah,39h
        jz      FArry1

        cmp     ah,20h
        jz      FArry2

        add     ah,1
        jmp short FArry4

FArry1: sub     eax,8FFh
        cmp     al,30h
        ja      FArry4

        mov     al,31h
        jmp short FArry4

FArry2: mov     ah,31h
        jmp short FArry4

FArry3: add     eax,10000h

FArry4: mov     [edx],eax
        add     edx,4
        cmp     eax,393030h
        jb      FArry3

        ja      FArry0

        ret

FillArray ENDP

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 04:26:58 AM

Better if you do test it and post the results. We are in the lab here,
and, by the way, you should use:

Code Select


        mov     edx,offset MyArray   
        mov     eax,30202020h          ;EAX = "   0"

instead of:

Code Select


        mov     edx,offset MyArray   ;or whatever you name it - it should be 4-aligned
        mov     eax,302020h          ;EAX = "  0",0

to meet the requisites of the test. :P

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 04:36:46 AM

Quote from: frktons on December 09, 2012, 04:26:58 AM
Better if you do test it and post the results. We are in the lab here...

better for who ?
i got you started - you can't get it from there ??? :lol:

Quote from: frktons on December 09, 2012, 04:26:58 AM
...and, by the way, you should use:
Code Select Expand
mov edx,offset MyArray mov eax,30202020h ;EAX = " 0"

instead of:

Code Select Expand
mov edx,offset MyArray ;or whatever you name it - it should be 4-aligned mov eax,302020h ;EAX = " 0",0

to meet the requisites of the test. :P

no - we want to start with 2 spaces, an ASCII 0, and a null terminator
so the high byte of EAX should be binary 0

i will give you another idea, though
it might be slightly faster to start at the end of the array with "999",0 and work down

Title: Re: The fastest way to fill a dword array with string values
Post by: TouEnMasm on December 09, 2012, 04:40:56 AM

Quote
This is a good solution, but I don't think it is the fastest around.
By the way if you want to post your test results we can compare
them with other solutions.

I am not interested in wining some µs in a prog who is just used one time.
it was just a pseudo-code sample.

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 09, 2012, 04:44:52 AM

Folks,
If it is supposed to be fast, it must be a solution with dword-packed xmm regs.
Load them with
[ 0 1 2 3]
first, than add
[ 4 4 4 4]
to get
[ 4 5 6 7]
Go ahead, Frank!

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 04:53:57 AM

Quote from: ToutEnMasm on December 09, 2012, 04:40:56 AM
Quote from: dedndave on December 09, 2012, 04:36:46 AM

i got you started - you can't get it from there ??? :lol:

I'm not able to correct your code, better if you do it yourself. :eusa_snooty:

Quote
no - we want to start with 2 spaces, an ASCII 0, and a null terminator
so the high byte of EAX should be binary 0

Well that could be another test, now the test is what I stated in the first post. :P

Quote
I am not interested in wining some µs in a prog who is just used one time.
it was just a pseudo-code sample.

In my intentions this will be part of a standard proc/routine I'm thinking about.
If you are not interested in the matter, it doesn't matter, you are welcome the
same. :t

Quote from: jj2007 on December 09, 2012, 04:44:52 AM
Folks,
If it is supposed to be fast, it must be a solution with dword-packed xmm regs.
Load them with
[ 0 1 2 3]
first, than add
[ 4 4 4 4]
to get
[ 4 5 6 7]
Go ahead, Frank!

This is what I was thinking about. It is not yet clear in my mind the sequence to
make it the fastest [how many xmm registers to use, which instructions...] but that
is the idea for a fast filling. :t

Title: Re: The fastest way to fill a dword array with string values
Post by: KeepingRealBusy on December 09, 2012, 05:05:11 AM

If you want them in ascending sequence, it should be [ 3 2 1 0] since the xmm reg is loaded/stored in little endian order. Then you need to take of carries frpm 9 to 0 with the next character incremented. This could get troublesome.

What I did was to set a string up to "0000", pick it up and store it, set a pointer to the lowest character, then increment the character pointed to by the pointer. Check if the character exceeds '9', if not pickup and save the string. If greater then '9' add (256-10) to the WORD which has the current pointer as its lowest BYTE which will set the current character to '0' (256-10+'9'+1 = '0' plus a carry to the upper BYTE) and increment the next higher character, then change the pointer to check that character for exceeding '9', etc, etc. After finding a valid character, always drop the pointer back to the lowest character and pick up the value and save it and increment the character at the pointer.

Dave

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 05:12:50 AM

Quote from: KeepingRealBusy on December 09, 2012, 05:05:11 AM
If you want them in ascending sequence, it should be [ 3 2 1 0] since the xmm reg is loaded/stored in little endian order. Then you need to take of carries frpm 9 to 0 with the next character incremented. This could get troublesome.

What I did was to set a string up to "0000", pick it up and store it, set a pointer to the lowest character, then increment the character pointed to by the pointer. Check if the character exceeds '9', if not pickup and save the string. If greater then '9' add (256-10) to the WORD which has the current pointer as its lowest BYTE which will set the current character to '0' (256-10+'9'+1 = '0' plus a carry to the upper BYTE) and increment the next higher character, then change the pointer to check that character for exceeding '9', etc, etc. After finding a valid character, always drop the pointer back to the lowest character and pick up the value and save it and increment the character at the pointer.

Dave

The order of bytes you suggested is correct, Dave.

It should not be that difficult, at least as I foresee it, I haven't yet
implemented it, but I only imagined it. Jochen's suggestion is
probably the fastest path, but we are still in the preliminary steps,
everything should be tested before seeing the results.

Your idea seems to use mem vars, and this could slow down the
process.

If you test it, let me know your results. :t

Title: Re: The fastest way to fill a dword array with string values
Post by: KeepingRealBusy on December 09, 2012, 05:17:38 AM

You could use regs and different offsets in the regs for the testing and correcting. I was using an 8 byte valuefor my test.

Dave.

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 05:33:21 AM

A couple of years ago I wrote a prototype in PBCC and after I translated it
in asm for a program that displays in a console screen all the ASCII chars
from 1 to 255, each one of them preceded by the ASCII number of the char.
I used 3 BYTE variables to represent the hundreds, tens and units, and filled
the console buffer with them.

Now I'd like to have a much faster routine, so instead of variables I'm going
to use GPRS and XMM registers. I need some time to write and test the code.
In the meanwhile if somebody tries any solution, is welcome to post the
solution and the performance test results. ;)

Title: Re: The fastest way to fill a dword array with string values
Post by: CommonTater on December 09, 2012, 06:19:41 AM

Quote from: frktons on December 09, 2012, 02:49:23 AM
I'd like to fill an array of 1000 dword with the string values ranging
from ' 0' to ' 999'.
To avoid 400k increment in program size and writing manually 1000
values in .data, I prefer to declare the array in .data? and fill the
array with a proc: call FillArray.
I've some ideas on how to to that, but before starting the tests I'd like
your suggestions, code, already done experiments.. to think upon.

Let me know.

Frank

Rather than attempting to contain a large array inside your program it would be better to allocate memory from a heap at run time. It's the same thing... but it doesn't bloat the daylights out of your program image.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 06:54:28 AM

tater,
1000 dwords is quite small, in terms of memory allocation
when you put something in the uninitialized data section, the size of the exe does not increase
only the image at run-time

if you allocate memory or put it in uninitialized data, it uses memory either way
the advantage to allocation is that it may be free'd up
so - if the table is to be used throughout program execution, that advantage is lost

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 06:56:47 AM

Quote from: dedndave on December 09, 2012, 06:54:28 AM
tater,
1000 dwords is quite small, in terms of memory allocation
when you put something in the uninitialized data section, the size of the exe does not increase
only the image at run-time

if you allocate memory or put it in uninitialized data, it uses memory either way
the advantage to allocation is that it may be free'd up
so - if the table is to be used throughout program execution, that advantage is lost

Agreed, the idea is to use the array throughout program execution.

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 09, 2012, 07:14:20 AM

Quote from: dedndave on December 09, 2012, 06:54:28 AM
when you put something in the uninitialized data section, the size of the exe does not increase
only the image at run-time

Test it:

Code Select

include \masm32\include\masm32rt.inc

.data?
MyArray	dd 280000 dup(?)

.code
start:	inkey "Hello World"
	exit

end start

... and get to know a well-known MASM bug. Even ML 10 still shows that odd behaviour...

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 07:53:41 AM

yah - i know that bug, Jochen - lol
but, we are only allocating 1000 dwords :biggrin:

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 08:22:54 AM

Quote from: jj2007 on December 09, 2012, 07:14:20 AM

Test it:
Code Select Expand
include \masm32\include\masm32rt.inc .data? MyArray dd 280000 dup(?) .code start: inkey "Hello World" exit end start

... and get to know a well-known MASM bug. Even ML 10 still shows that odd behaviour...

I didn't get any strange behaviour, and the exe size is 3072 bytes.
when the pgm runs it displays Hello World and waits for the
keystroke. :icon_rolleyes:

N.B. I use ML10.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 08:41:05 AM

the strange behaviour is that it takes long to assemble - at least using ML v6.x it does

here you go, Frank
i get about 2.3 cycles per string on my P4 prescott

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 09:01:31 AM

Quote from: dedndave on December 09, 2012, 08:41:05 AM
the strange behaviour is that it takes long to assemble - at least using ML v6.x it does

here you go, Frank
i get about 2.3 cycles per string on my P4 prescott

Good starting point, Dave. Let's see the next algos to decide the fastest around.

On my pc I have 1.3 cycles per string. :t

Quote
;that'll be $50, Frank

We'll see in few days if you earned them :lol:

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 10:07:32 AM

For the time being, Dave's routine got these results on my pc:

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1934 cycles for Dedndave code
------------------------------------------------------------------------
1927 cycles for Dedndave code
------------------------------------------------------------------------
1917 cycles for Dedndave code
------------------------------------------------------------------------
1918 cycles for Dedndave code
------------------------------------------------------------------------

That is about 1.9 cycles per dword string. Not bad. I think we
can arrive at 0.7 cycles per dword string, but it has yet to be
demonstrated. :lol:

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 11:04:59 AM

Quote from: frktons on December 09, 2012, 09:01:31 AMWe'll see in few days if you earned them :lol:

i accept pay-pal :biggrin:

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 11:40:13 AM

Code Select

------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2556    cycles for Dedndave code
------------------------------------------------------------------------
2256    cycles for Dedndave code
------------------------------------------------------------------------
2258    cycles for Dedndave code
------------------------------------------------------------------------
2271    cycles for Dedndave code
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: MichaelW on December 09, 2012, 12:32:20 PM

For Dave's code I get 2292 2299 2303 2302 2290 on a P4 Northwood and 2674 2676 2674 2677 2690 on a P3. Frank's EXE will not run on my P3 Windows 2000 system:

"Test_FillArray.exe is not a valid Win32 application."

But on my P4 Northwood Windows XP system I typically get:

Code Select


------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2
------------------------------------------------------------------------
2676    cycles for Dedndave code
------------------------------------------------------------------------
2655    cycles for Dedndave code
------------------------------------------------------------------------
2306    cycles for Dedndave code
------------------------------------------------------------------------
2277    cycles for Dedndave code
------------------------------------------------------------------------

But running it multiple times the first one or two results are significantly higher than the others, suggesting that the code is not delaying long enough after it loads before it starts counting cycles. After adding a:

invoke Sleep, 5000

Below the start label and assembling and linking with ML 6.15.8803 and Link 5.12.8078, typical results are:

Code Select


------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2
------------------------------------------------------------------------
2278    cycles for Dedndave code
------------------------------------------------------------------------
2296    cycles for Dedndave code
------------------------------------------------------------------------
2263    cycles for Dedndave code
------------------------------------------------------------------------
2285    cycles for Dedndave code
------------------------------------------------------------------------

And the EXE will then run on my P3 system:

Code Select


------------------------------------------------------------------------
Pentium III

Instructions: MMX, SSE1
------------------------------------------------------------------------
2457    cycles for Dedndave code
------------------------------------------------------------------------
2437    cycles for Dedndave code
------------------------------------------------------------------------
2437    cycles for Dedndave code
------------------------------------------------------------------------
2439    cycles for Dedndave code
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 09, 2012, 01:21:46 PM

That will be a close race between Dave and Frank...
:biggrin:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1556 cycles for 100 * FA Dave
541 cycles for 100 * FA Jochen
1557 cycles for 100 * FA Frank

1556 cycles for 100 * FA Dave
539 cycles for 100 * FA Jochen
1557 cycles for 100 * FA Frank

1558 cycles for 100 * FA Dave
542 cycles for 100 * FA Jochen
1564 cycles for 100 * FA Frank

230 bytes for FA Dave
281 bytes for FA Jochen
350 bytes for FA Frank

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 01:54:39 PM

Quote from: jj2007 on December 09, 2012, 01:21:46 PM
That will be a close race between Dave and Frank...
:biggrin:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1556 cycles for 100 * FA Dave
541 cycles for 100 * FA Jochen
1557 cycles for 100 * FA Frank

1556 cycles for 100 * FA Dave
539 cycles for 100 * FA Jochen
1557 cycles for 100 * FA Frank

1558 cycles for 100 * FA Dave
542 cycles for 100 * FA Jochen
1564 cycles for 100 * FA Frank

230 bytes for FA Dave
281 bytes for FA Jochen
350 bytes for FA Frank

Actually I didn't write any code. I only posted Dave code in
another testbed. Well with your routine you demonstrated that
the good job of Dave could be optimized with better registers
and parallel computing.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 01:55:28 PM

i don't think Frank has any working code, per se
looks like you got the speed thing :t
did you verify the resulting table ? :P

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 246/100 cycles

2069    cycles for 100 * FA Dave
915     cycles for 100 * FA Jochen
2230    cycles for 100 * FA Frank

2029    cycles for 100 * FA Dave
913     cycles for 100 * FA Jochen
2592    cycles for 100 * FA Frank

2034    cycles for 100 * FA Dave
930     cycles for 100 * FA Jochen
2217    cycles for 100 * FA Frank

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 09, 2012, 01:59:51 PM

Jochen,
the source you posted is for MasmBasic editor, I suppose,
couldn't you post a normal ascii text file, in order to see what it
actually does?

Frank

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 02:04:35 PM

it's a "rich" text file, Frank
i think you can open it with WordPad, if you don't have anything else :t

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 02:09:17 PM

this is Jochen's data and code

Code Select

align 16
TestB_s:
Src0123	db "   0   1   2   3"
Add4444	dd 04000000h, 04000000h, 04000000h, 04000000h	; xmm1
Add44xx	dd 04000000h, 04000000h, 0FA110000h, 0FA110000h	; xmm2
Addxx44	dd 0FA110000h, 0FA110000h, 04000000h, 04000000h	; xmm3
Addxxxx	dd 0FA010000h, 0FA010000h, 0FA010000h, 0FA010000h	; xmm4

Add244xx	dd 04000000h, 04000000h, 0FA010000h, 0FA010000h	; xmm2
Add2xx44	dd 0FA010000h, 0FA010000h, 04000000h, 04000000h	; xmm3
Add2xxxx	dd 0FA010000h, 0FA010000h, 0FA010000h, 0FA010000h	; xmm4

Sub100a	dd 00009EF00h, 00009EF00h, 00009EF00h, 00009EF00h
Sub100b	dd 00009FF00h, 00009FF00h, 00009FF00h, 00009FF00h

NameB equ FA Jochen	; assign a descriptive name here
TestB proc
	mov esi, offset Src0123
	mov edi, offset MyArray
	push edi
	xor ecx, ecx
	movaps xmm0, [esi]
	movaps xmm1, [esi+16]
	movaps xmm2, [esi+32]
	movaps xmm3, [esi+48]
	movaps xmm4, [esi+64]
	lea edx, [edi+4000]
	m2m ecx, -5
;	align 4
	.Repeat
		movaps [edi], xmm0
		paddd xmm0, xmm1	; 4444
		movaps [edi+16], xmm0
		paddd xmm0, xmm2	; 44xx
		movaps [edi+32], xmm0
		paddd xmm0, xmm3	; xx44
		movaps [edi+48], xmm0
		paddd xmm0, xmm1	; 4444
		movaps [edi+64], xmm0
		paddd xmm0, xmm4	; xxxx
		inc ecx
		.if Zero?
		    psubd xmm0, oword ptr Sub100a
		.elseif ecx==-4
			movaps xmm2, [esi+80]
			movaps xmm3, [esi+96]
			movaps xmm4, [esi+112]
		.elseif ecx==5
		    psubd xmm0, oword ptr Sub100b
		    xor ecx, ecx
		.endif
		add edi, 80
	.Until edi>=edx
	pop eax
  ret
TestB endp
TestB_endp:

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 09, 2012, 06:26:42 PM

Quote from: dedndave on December 09, 2012, 02:04:35 PM
it's a "rich" text file, Frank
i think you can open it with WordPad, if you don't have anything else :t

Frank does have something else: \Masm32\RichMasm\RichMasm.exe
Drag *.asc over RichMasm.exe, then click on the bookmarks "Test A" and "Test B" on the right :icon_mrgreen:

By the way: performance drops dramatically if you replace movaps with movups. But there is a well-known workaround, see attachment.

For MisAlign=0 and MisAlign=8, performance is almost like movaps (ca. 8 cycles slower, see below). For all other values, it is still almost 500 cycles faster than Dave's non-SSE code.

MisAlign=0:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1558 cycles for 100 * FA Dave
540 cycles for 100 * FA Jochen
547 cycles for 100 * FA Jochen unaligned

1557 cycles for 100 * FA Dave
539 cycles for 100 * FA Jochen
548 cycles for 100 * FA Jochen unaligned

MisAlign=3:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1845 cycles for 100 * FA Dave
1365 cycles for 100 * FA Jochen unaligned

1853 cycles for 100 * FA Dave
1362 cycles for 100 * FA Jochen unaligned

Title: Re: The fastest way to fill a dword array with string values
Post by: sinsi on December 09, 2012, 06:44:29 PM

I got times around 850 for my code (no sse) but half the size.
Here's your latest jj

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 231/100 cycles

856 cycles for 100 * FA Dave
449 cycles for 100 * FA Jochen
436 cycles for 100 * FA Jochen unaligned

803 cycles for 100 * FA Dave
445 cycles for 100 * FA Jochen
435 cycles for 100 * FA Jochen unaligned

809 cycles for 100 * FA Dave
454 cycles for 100 * FA Jochen
433 cycles for 100 * FA Jochen unaligned

230 bytes for FA Dave
281 bytes for FA Jochen
141 bytes for FA Jochen unaligned

Title: Re: The fastest way to fill a dword array with string values
Post by: TouEnMasm on December 09, 2012, 07:03:32 PM

Here is a source code to verify mmx is faster:

Quote
   .data
   array dd 1000 dup (0)
   controle dd 0
   Nomfichier db "resultat.txt",0
   Hfile dd 0
   NumberOfBytesWritten dd 0
   retourligne db 13,10,0
   .code

   start:
   mov eax,30303020h ; 000
   mov edx,offset array
   unit:
   mov [edx],eax   ;000
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;1
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;2
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;3
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;4
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;5
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;6
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;7
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;8
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;9
   ;-----------------------------
   sub eax,9000000h
   add eax,10000h
   add edx,4
   mov ecx,eax
   shr ecx,16
   .if cl == 3Ah ;1-
      sub eax,0A0000h
      inc ah   ;303030h + 10000h
      .if ah == 3Ah
         ;fini
         jmp fin
      .else
         jmp unit
      .endif
   .else
      jmp unit
   .endif
   fin:
   lea edx,controle   ;debug limit must be NULL
   lea edx,array   ;debug view what is in memory
   invoke CreateFile,addr Nomfichier,GENERIC_WRITE,NULL,\
         NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,0
   mov Hfile,eax
   mov edx,offset array
   mov ecx,0
   ecrire:
   push edx
   push ecx
   invoke WriteFile,Hfile,edx,400,addr NumberOfBytesWritten,NULL
   invoke WriteFile,Hfile,addr retourligne,2,addr NumberOfBytesWritten,NULL
   pop ecx
   pop edx
   add edx,400
   add ecx,100
   .if ecx != 1000
      jmp ecrire
   .endif

   invoke CloseHandle,Hfile
   invoke ExitProcess,0
;################################################################
   end start

ascii table is put in the texte file 000 to 999

Title: Re: The fastest way to fill a dword array with string values
Post by: sinsi on December 09, 2012, 07:15:06 PM

116 bytes of slackware

Code Select


TestA proc 
    pusha
    
    mov edi,offset MyArray
    mov ebp,00303030h
    mov ecx,10
l2: mov edx,10
    push ebp
l1: call proc0
    add ebp,00000100h
    sub edx,1
    jnz l1
    pop ebp
    inc ebp ;add ebp,00000001h
    sub ecx,1
    jnz l2
    popa
    mov eax,offset MyArray
    ret
proc0:
    push ecx
    push edx
    
    lea eax,[ebp]
    lea ebx,[ebp+00010000h]
    lea ecx,[ebp+00020000h]
    lea edx,[ebp+00030000h]
    lea esi,[ebp+00040000h]

    mov ebp,00050000h
    
    mov [edi],eax
    mov [edi+4],ebx
    mov [edi+8],ecx
    mov [edi+12],edx
    mov [edi+16],esi
    
    add eax,ebp
    add ebx,ebp
    add ecx,ebp
    add edx,ebp
    add esi,ebp

    mov [edi+20],eax
    mov [edi+24],ebx
    mov [edi+28],ecx
    mov [edi+32],edx
    mov [edi+36],esi
    
    add edi,40
    pop edx
    pop ecx
    ret
    


TestA endp

This would be nice to do in 64-bit with all those regs, but the xmm's are much better.

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 09, 2012, 07:35:07 PM

Quote from: ToutEnMasm on December 09, 2012, 07:03:32 PM
Here is a source code to verify mmx is faster:

Where's da mmx, Yves?

Quote from: sinsi on December 09, 2012, 07:15:06 PM
116 bytes of slackware
...
This would be nice to do in 64-bit with all those regs, but the xmm's are much better.

SSE2 is difficult to beat indeed...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1557 cycles for 100 * FA Dave
540 cycles for 100 * FA Jochen
1838 cycles for 100 * FA Sinsi
547 cycles for 100 * FA Jochen unaligned
1540 cycles for 100 * FA Yves

Title: Re: The fastest way to fill a dword array with string values
Post by: sinsi on December 09, 2012, 08:05:45 PM

Just goes to show, these tests are of academic interest only :biggrin:

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 228/100 cycles

786 cycles for 100 * FA Dave
441 cycles for 100 * FA Jochen
849 cycles for 100 * FA Sinsi
419 cycles for 100 * FA Jochen unaligned
1021 cycles for 100 * FA Yves

786 cycles for 100 * FA Dave
437 cycles for 100 * FA Jochen
849 cycles for 100 * FA Sinsi
431 cycles for 100 * FA Jochen unaligned
1029 cycles for 100 * FA Yves

779 cycles for 100 * FA Dave
435 cycles for 100 * FA Jochen
846 cycles for 100 * FA Sinsi
417 cycles for 100 * FA Jochen unaligned
1023 cycles for 100 * FA Yves

230 bytes for FA Dave
281 bytes for FA Jochen
116 bytes for FA Sinsi
141 bytes for FA Jochen unaligned
141 bytes for FA Yves

4208864 = eax FA Dave
4208864 = eax FA Jochen
4208864 = eax FA Sinsi
4208864 = eax FA Jochen unaligned
4208864 = eax FA Yves

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 08:11:22 PM

it should generate leading-zero suppressed strings, right-justified in a field of 3 spaces

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 240/100 cycles

2040 cycles for 100 * FA Dave
919 cycles for 100 * FA Jochen
2609 cycles for 100 * FA Sinsi
1226 cycles for 100 * FA Jochen unaligned
1854 cycles for 100 * FA Yves

2040 cycles for 100 * FA Dave
943 cycles for 100 * FA Jochen
2606 cycles for 100 * FA Sinsi
1197 cycles for 100 * FA Jochen unaligned
1816 cycles for 100 * FA Yves

2037 cycles for 100 * FA Dave
920 cycles for 100 * FA Jochen
2602 cycles for 100 * FA Sinsi
1196 cycles for 100 * FA Jochen unaligned
1892 cycles for 100 * FA Yves

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 09, 2012, 08:25:09 PM

Quote from: sinsi on December 09, 2012, 08:05:45 PM
Just goes to show, these tests are of academic interest only :biggrin:

Absolutely :greensml:

New version:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1556 cycles for 100 * FA Dave
1811 cycles for 100 * FA Sinsi
441 cycles for 100 * FA Jochen unaligned
1541 cycles for 100 * FA Yves

P.S.: If you don't agree with the suffix "FINAL", write a faster algo :bgrin:

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 09, 2012, 08:29:13 PM

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 250/100 cycles

2038 cycles for 100 * FA Dave
2590 cycles for 100 * FA Sinsi
1188 cycles for 100 * FA Jochen unaligned
1813 cycles for 100 * FA Yves

2079 cycles for 100 * FA Dave
2593 cycles for 100 * FA Sinsi
1181 cycles for 100 * FA Jochen unaligned
1795 cycles for 100 * FA Yves

2061 cycles for 100 * FA Dave
2587 cycles for 100 * FA Sinsi
1216 cycles for 100 * FA Jochen unaligned
1791 cycles for 100 * FA Yves

Title: Re: The fastest way to fill a dword array with string values
Post by: TouEnMasm on December 09, 2012, 08:31:22 PM

Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
loop overhead is approx. 255/100 cycles

2057 cycles for 100 * FA Dave
2581 cycles for 100 * FA Sinsi
1210 cycles for 100 * FA Jochen unaligned
1858 cycles for 100 * FA Yves

2060 cycles for 100 * FA Dave
2550 cycles for 100 * FA Sinsi
1214 cycles for 100 * FA Jochen unaligned
1855 cycles for 100 * FA Yves

2056 cycles for 100 * FA Dave
2553 cycles for 100 * FA Sinsi
1213 cycles for 100 * FA Jochen unaligned
1857 cycles for 100 * FA Yves

230 bytes for FA Dave
116 bytes for FA Sinsi
149 bytes for FA Jochen unaligned
141 bytes for FA Yves

4208520 = eax FA Dave
4208520 = eax FA Sinsi
4208520 = eax FA Jochen unaligned
4208520 = eax FA Yves

--- ok ---

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 12:28:45 AM

According to these tests we reached the bottom line, almost.
SSE code is always faster than any other, and Dave won $50
for being the first to post his code :t

As I foresaw:

Quote from: frktons on December 09, 2012, 10:07:32 AM
...
Not bad. I think we
can arrive at 0.7 cycles per dword string, but it has yet to be
demonstrated. :lol:

My test with Jochen's testpad:

Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
loop overhead is approx. 187/100 cycles

1870 cycles for 100 * FA Dave
648 cycles for 100 * FA Jochen
2442 cycles for 100 * FA Sinsi
791 cycles for 100 * FA Jochen unaligned
2009 cycles for 100 * FA Yves

1886 cycles for 100 * FA Dave
648 cycles for 100 * FA Jochen
2418 cycles for 100 * FA Sinsi
791 cycles for 100 * FA Jochen unaligned
2012 cycles for 100 * FA Yves

1871 cycles for 100 * FA Dave
648 cycles for 100 * FA Jochen
2430 cycles for 100 * FA Sinsi
790 cycles for 100 * FA Jochen unaligned
2010 cycles for 100 * FA Yves

230 bytes for FA Dave
281 bytes for FA Jochen
116 bytes for FA Sinsi
141 bytes for FA Jochen unaligned
141 bytes for FA Yves

4208864 = eax FA Dave
4208864 = eax FA Jochen
4208864 = eax FA Sinsi
4208864 = eax FA Jochen unaligned
4208864 = eax FA Yves

--- ok ---

And the prize for Dave:

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 01:49:27 AM

I've divided my code in 3 steps, from the slowest to the fastest.
Here it is the slowest, with GPRs registers:

Quote
-------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------------
1914 cycles for Dedndave code
1983 cycles for Frktons I Step
-------------------------------------------------------
1914 cycles for Dedndave code
1973 cycles for Frktons I Step
-------------------------------------------------------
1944 cycles for Dedndave code
1972 cycles for Frktons I Step
-------------------------------------------------------
1913 cycles for Dedndave code
1976 cycles for Frktons I Step
-------------------------------------------------------

I measured it against Dave's code that uses the same
kind of registers.

Title: Re: The fastest way to fill a dword array with string values
Post by: Gunther on December 10, 2012, 02:15:44 AM

Here are my results:

Code Select


------------------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
------------------------------------------------------------------------
1577	cycles for Dedndave code
2461	cycles for Frktons I Step
------------------------------------------------------------------------
1048	cycles for Dedndave code
2492	cycles for Frktons I Step
------------------------------------------------------------------------
1058	cycles for Dedndave code
2497	cycles for Frktons I Step
------------------------------------------------------------------------
1048	cycles for Dedndave code
2495	cycles for Frktons I Step
------------------------------------------------------------------------

--- ok ---

Gunther

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 10:30:53 AM

An incredible difference between my Core Duo 2 and i7.
Adding and subtracting is very liked by the i7. :icon_eek:

Probably Jochen's solution is hard to beat for the time being. :eusa_clap:

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 02:22:23 PM

It looks like with GPRs it is quite difficult to go below
1.900 CPU cycles, at least on my machine:

Quote
-----------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-----------------------------------------------
1914 cycles for Dedndave code - 5 GPRs
2574 cycles for Frktons I Step / 2 GPRs
1889 cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
1940 cycles for Dedndave code - 5 GPRs
2561 cycles for Frktons I Step / 2 GPRs
1886 cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
1913 cycles for Dedndave code - 5 GPRs
2561 cycles for Frktons I Step / 2 GPRs
1900 cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
1913 cycles for Dedndave code - 5 GPRs
2561 cycles for Frktons I Step / 2 GPRs
1887 cycles for Frktons I Step / 4 GPRs
-----------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 09:27:54 PM

Only using MMX registers, the performance start to rock.

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1926 cycles for Dedndave code - 5 GPRs
1857 cycles for Frktons I Step / 2 GPRs
1896 cycles for Frktons I Step / 4 GPRs
1320 cycles for Frktons II Step / 5 MMX
830 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1919 cycles for Dedndave code - 5 GPRs
1897 cycles for Frktons I Step / 2 GPRs
1895 cycles for Frktons I Step / 4 GPRs
1321 cycles for Frktons II Step / 5 MMX
831 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1916 cycles for Dedndave code - 5 GPRs
1898 cycles for Frktons I Step / 2 GPRs
1897 cycles for Frktons I Step / 4 GPRs
1319 cycles for Frktons II Step / 5 MMX
830 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1915 cycles for Dedndave code - 5 GPRs
1892 cycles for Frktons I Step / 2 GPRs
1894 cycles for Frktons I Step / 4 GPRs
1327 cycles for Frktons II Step / 5 MMX
830 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

And there are still many things to optimize.

Title: Re: The fastest way to fill a dword array with string values
Post by: sinsi on December 10, 2012, 10:01:46 PM

Exception code: 0xc000001d
Fault offset: 0x00001619

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 10, 2012, 10:16:17 PM

STATUS_ILLEGAL_INSTRUCTION = 0xc000001d

he is probably using SSSE3 instructions in "Frktons II Step" :P

Code Select

      lea esi, Tens
      lea edi, MyArray
      movq mm5, Mask2Double
      movq mm6, ONE_TWO
      movq mm7, TWO_TWO

    align 4

    @@:

      mov eax, [esi]
      movd mm0, eax

      pshufb mm0, mm5
      paddd mm0, mm6
      movq  mm1, mm0
      paddd mm1, mm7
      movq  mm2, mm1
      paddd mm2, mm7
      movq  mm3, mm2
      paddd mm3, mm7
      movq  mm4, mm3
      paddd mm4, mm7

      movq [edi], mm0
      movq [edi + 8], mm1
      movq [edi + 16], mm2
      movq [edi + 24], mm3
      movq [edi + 32], mm4

      mov eax, [esi + 4]
      movd mm0, eax

      pshufb mm0, mm5
      paddd mm0, mm6
      movq  mm1, mm0
      paddd mm1, mm7
      movq  mm2, mm1
      paddd mm2, mm7
      movq  mm3, mm2
      paddd mm3, mm7
      movq  mm4, mm3
      paddd mm4, mm7

      movq [edi + 40], mm0
      movq [edi + 48], mm1
      movq [edi + 56], mm2
      movq [edi + 64], mm3
      movq [edi + 72], mm4
      add esi, 8
      add edi, 80

      cmp esi, PtrTens

      jl @B

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 10:26:15 PM

Quote from: sinsi on December 10, 2012, 10:01:46 PM
Exception code: 0xc000001d
Fault offset: 0x00001619

Do you have PSHUFB on your PC? It is SSSE3, Dave is correct.
You can use a similar instruction to duplicate low dword and store it
to high dword of mmx register, something like:

Code Select


first_dw dd 0     ; put here the same value of second_dw 
second_dw dd 0 ; put here the same value of first_dw 

....

lea eax, first_dw
movq mm0, [eax]

N.B. Not tested

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 10, 2012, 10:33:52 PM

by using SSSE3 (or even SSE3), you exclude a lot of the older CPU's that are still in use

really - this is a program init function
it seems kinda silly to exclude a CPU unless SSSE3 is to be used throughout the rest of the program

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 10:36:56 PM

Quote from: dedndave on December 10, 2012, 10:33:52 PM
by using SSSE3 (or even SSE3), you exclude a lot of the older CPU's that are still in use

really - this is a program init function
it seems kinda silly to exclude a CPU unless SSSE3 is to be used throughout the rest of the program

The owners of old CPUs should read my previous post. :lol:

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 10, 2012, 10:38:38 PM

you are going to tell me i need a new one ? - lol
i am very happy with the one i have
until microsoft or adobe or some other a-hole forces me to be unhappy

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 10:45:11 PM

Quote from: dedndave on December 10, 2012, 10:38:38 PM
you are going to tell me i need a new one ? - lol
i am very happy with the one i have
until microsoft or adobe or some other a-hole forces me to be unhappy

Not really. I'm happy too with mine, a dual core P-IV, and a Core Duo.
I was talking about asm coders. They know how to change 1 single
instruction to suit their needs.
By the way, I'm only testing some instructions, nothing is going to production
as always with me. [MASM for FUN only].

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 10, 2012, 10:50:09 PM

you could add a little test to see if the CPU supports SSSE3
if it does not, skip that test and say "Frktons II Step requires SSSE3 support" and go on to the next test

Code Select

    mov     eax,1
    cpuid
    test    ch,2
    jz      no_ssse3_support

note: .586 or higher required to assemble "cpuid" without hard-coding

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 10:56:40 PM

Quote from: dedndave on December 10, 2012, 10:50:09 PM
you could add a little test to see if the CPU supports SSSE3
if it does not, skip that test and say "Frktons II Step requires SSSE3 support" and go on to the next test
Code Select Expand
mov eax,1 cpuid test ch,2 jz no_ssse3_support

note: .586 or higher required to assemble "cpuid" without hard-coding

Good idea my dear Master, I'm going to test it. Inter alia, one of my pc doesn't
have SSSE3 if I correctly remember.

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 11:03:25 PM

OK, I updated the proggy, would you test it?

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 10, 2012, 11:39:34 PM

:t

Code Select

------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2270    cycles for Dedndave code - 5 GPRs
1940    cycles for Frktons I Step / 2 GPRs
1950    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
1165    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2242    cycles for Dedndave code - 5 GPRs
1962    cycles for Frktons I Step / 2 GPRs
1949    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
1176    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2248    cycles for Dedndave code - 5 GPRs
1939    cycles for Frktons I Step / 2 GPRs
1956    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
1161    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2268    cycles for Dedndave code - 5 GPRs
1947    cycles for Frktons I Step / 2 GPRs
1987    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
1191    cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 10, 2012, 11:40:52 PM

here's a little routine i wrote just for you, Frank :biggrin:

Code Select

CpuFeatures PROC

;call once during program initialization
;store the value returned in EAX (AL, actually) for feature verification
;
;0 = no extended features present
;1 = MMX
;2 = SSE
;3 = SSE2
;4 = SSE3
;5 = SSSE3
;6 = SSE4

        mov     eax,1
        cpuid
        bswap   edx          ;MMX -> bit 8, SSE1 -> bit 6, SSE2 -> bit 5
        xor     eax,eax
        test    dh,1         ;MMX
        jz      CpuF00

        inc     eax
        test    dl,40h       ;SSE1
        jz      CpuF00

        inc     eax
        test    dl,20h       ;SSE2
        jz      CpuF00

        inc     eax
        test    cl,1         ;SSE3
        jz      CpuF00

        inc     eax
        test    cl,2         ;SSSE3
        jz      CpuF00

        inc     eax
        test    ecx,80000h   ;SSE4
        jz      CpuF00

        inc     eax

CpuF00: ret

CpuFeatures ENDP

Code Select

        .DATA?

bFeatures db ?

        .CODE

Start:  call    CpuFeatures
        mov     bFeatures,al
;
;
;

now, if you want to see if they have SSSE3....

Code Select

        cmp     bFeatures,5
        jb      no_ssse3_support

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 10, 2012, 11:54:20 PM

Quote from: dedndave on December 10, 2012, 11:40:52 PM
here's a little routine i wrote just for you, Frank :biggrin:
Code Select Expand
CpuFeatures PROC ;call once during program initialization ;store the value returned in EAX (AL, actually) for feature verification ; ;0 = no extended features present ;1 = MMX ;2 = SSE ;3 = SSE2 ;4 = SSE3 ;5 = SSSE3 ;6 = SSE4 mov eax,1 cpuid bswap edx ;MMX -> bit 8, SSE1 -> bit 6, SSE2 -> bit 5 xor eax,eax test dh,1 ;MMX jz CpuF00 inc eax test dl,40h ;SSE1 jz CpuF00 inc eax test dl,20h ;SSE2 jz CpuF00 inc eax test cl,1 ;SSE3 jz CpuF00 inc eax test cl,2 ;SSSE3 jz CpuF00 inc eax test ecx,80000h ;SSE4 jz CpuF00 inc eax CpuF00: ret CpuFeatures ENDP

Code Select Expand
.DATA? bFeatures db ? .CODE Start: call CpuFeatures mov bFeatures,al ; ; ;

now, if you want to see if they have SSSE3....
Code Select Expand
cmp bFeatures,5 jb no_ssse3_support

Thanks Dave, and here it is something I wrote for you :biggrin::

Quote
----------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------
1915 cycles for Dedndave code - 5 GPRs
1837 cycles for Frktons I Step / 2 GPRs
1893 cycles for Frktons I Step / 4 GPRs
1318 cycles for Frktons II Step / 5 MMX
1365 cycles for Frktons II Step / 5 MMX without SSSE3
836 cycles for Jochen / 5 XMM
----------------------------------------------------------------------
1913 cycles for Dedndave code - 5 GPRs
1892 cycles for Frktons I Step / 2 GPRs
1893 cycles for Frktons I Step / 4 GPRs
1317 cycles for Frktons II Step / 5 MMX
1366 cycles for Frktons II Step / 5 MMX without SSSE3
836 cycles for Jochen / 5 XMM
----------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 11, 2012, 12:23:31 AM

------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2250    cycles for Dedndave code - 5 GPRs
1943    cycles for Frktons I Step / 2 GPRs
1960    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
2148    cycles for Frktons II Step / 5 MMX without SSSE3
1161    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2242    cycles for Dedndave code - 5 GPRs
1952    cycles for Frktons I Step / 2 GPRs
1971    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
2268    cycles for Frktons II Step / 5 MMX without SSSE3
1161    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2271    cycles for Dedndave code - 5 GPRs
1950    cycles for Frktons I Step / 2 GPRs
1965    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
2149    cycles for Frktons II Step / 5 MMX without SSSE3
1166    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2240    cycles for Dedndave code - 5 GPRs
1941    cycles for Frktons I Step / 2 GPRs
1958    cycles for Frktons I Step / 4 GPRs
 Frktons II Step requires a PC with SSSE3
2144    cycles for Frktons II Step / 5 MMX without SSSE3
1166    cycles for Jochen / 5 XMM
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 11, 2012, 10:46:03 AM

Some improvements before the final release:

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1925 cycles for Dedndave code - 5 GPRs
1882 cycles for Frktons I Step / 2 GPRs
1941 cycles for Frktons I Step / 4 GPRs
1896 cycles for Frktons I Step / 4 GPRs - no external tab
1096 cycles for Frktons II Step / 5 MMX with SSSE3
1206 cycles for Frktons II Step / 5 MMX without SSSE3
836 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1917 cycles for Dedndave code - 5 GPRs
1882 cycles for Frktons I Step / 2 GPRs
1917 cycles for Frktons I Step / 4 GPRs
1893 cycles for Frktons I Step / 4 GPRs - no external tab
1091 cycles for Frktons II Step / 5 MMX with SSSE3
1206 cycles for Frktons II Step / 5 MMX without SSSE3
836 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

Almost close to the target. :t

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 11, 2012, 11:01:22 AM

prescott w/htt

Code Select

------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2379    cycles for Dedndave code - 5 GPRs
1947    cycles for Frktons I Step / 2 GPRs
1961    cycles for Frktons I Step / 4 GPRs
1986    cycles for Frktons I Step / 4 GPRs - no external tab
 Frktons II Step requires a PC with SSSE3
2460    cycles for Frktons II Step / 5 MMX without SSSE3
1152    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2251    cycles for Dedndave code - 5 GPRs
1966    cycles for Frktons I Step / 2 GPRs
1961    cycles for Frktons I Step / 4 GPRs
1978    cycles for Frktons I Step / 4 GPRs - no external tab
 Frktons II Step requires a PC with SSSE3
2456    cycles for Frktons II Step / 5 MMX without SSSE3
1151    cycles for Jochen / 5 XMM
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: MichaelW on December 11, 2012, 11:12:28 AM

P4 Northwood:

Code Select


------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2
------------------------------------------------------------------------
2275    cycles for Dedndave code - 5 GPRs
1979    cycles for Frktons I Step / 2 GPRs
1996    cycles for Frktons I Step / 4 GPRs
2034    cycles for Frktons I Step / 4 GPRs - no external tab
 Frktons II Step requires a PC with SSSE3
2970    cycles for Frktons II Step / 5 MMX without SSSE3
896     cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2330    cycles for Dedndave code - 5 GPRs
1970    cycles for Frktons I Step / 2 GPRs
1997    cycles for Frktons I Step / 4 GPRs
2734    cycles for Frktons I Step / 4 GPRs - no external tab
 Frktons II Step requires a PC with SSSE3
2937    cycles for Frktons II Step / 5 MMX without SSSE3
905     cycles for Jochen / 5 XMM
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 11, 2012, 11:51:05 AM

It looks like the MMX tech is a bit slow on pre-Core Duo CPUs.

We'll see the XMM one in action next. :P

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 12, 2012, 03:45:32 PM

I've almost finished the study. Here it is the fastest code
so far I could imagine, but still room to optimize.

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1950 cycles for Dedndave code - 5 GPRs
1924 cycles for Frktons I Step / 2 GPRs
1871 cycles for Frktons I Step / 4 GPRs
1888 cycles for Frktons I Step / 4 GPRs - no external tab
1079 cycles for Frktons II Step / 5 MMX with SSSE3
1199 cycles for Frktons II Step / 5 MMX without SSSE3
801 cycles for Frktons III Step / XMM/MMX with SSE2
831 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1916 cycles for Dedndave code - 5 GPRs
1930 cycles for Frktons I Step / 2 GPRs
1872 cycles for Frktons I Step / 4 GPRs
1915 cycles for Frktons I Step / 4 GPRs - no external tab
1083 cycles for Frktons II Step / 5 MMX with SSSE3
1209 cycles for Frktons II Step / 5 MMX without SSSE3
796 cycles for Frktons III Step / XMM/MMX with SSE2
831 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 12, 2012, 03:53:07 PM

prescott w/htt
------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2266 cycles for Dedndave code - 5 GPRs
1976 cycles for Frktons I Step / 2 GPRs
1994 cycles for Frktons I Step / 4 GPRs
2011 cycles for Frktons I Step / 4 GPRs - no external tab
Frktons II Step requires a PC with SSSE3
2459 cycles for Frktons II Step / 5 MMX without SSSE3
1141 cycles for Frktons III Step / XMM/MMX with SSE2
1183 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2257 cycles for Dedndave code - 5 GPRs
1983 cycles for Frktons I Step / 2 GPRs
1973 cycles for Frktons I Step / 4 GPRs
1996 cycles for Frktons I Step / 4 GPRs - no external tab
Frktons II Step requires a PC with SSSE3
2462 cycles for Frktons II Step / 5 MMX without SSSE3
1143 cycles for Frktons III Step / XMM/MMX with SSE2
1184 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: sinsi on December 12, 2012, 05:28:04 PM

------------------------------------------------------------
AMD Phenom(tm) II X6 1100T Processor

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------
948 cycles for Dedndave code - 5 GPRs
813 cycles for Frktons I Step / 2 GPRs
838 cycles for Frktons I Step / 4 GPRs
995 cycles for Frktons I Step / 4 GPRs - no external tab
Frktons II Step requires a PC with SSSE3
1089 cycles for Frktons II Step / 5 MMX without SSSE3
747 cycles for Frktons III Step / XMM/MMX with SSE2
666 cycles for Jochen / 5 XMM
------------------------------------------------------------
950 cycles for Dedndave code - 5 GPRs
823 cycles for Frktons I Step / 2 GPRs
845 cycles for Frktons I Step / 4 GPRs
996 cycles for Frktons I Step / 4 GPRs - no external tab
Frktons II Step requires a PC with SSSE3
1084 cycles for Frktons II Step / 5 MMX without SSSE3
749 cycles for Frktons III Step / XMM/MMX with SSE2
665 cycles for Jochen / 5 XMM
------------------------------------------------------------

--- ok ---

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 12, 2012, 06:53:03 PM

Frktons I Step / 2 GPRs and Frktons I Step / 4 GPRs are remarkably fast but you should have a look at the output.

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 12, 2012, 11:18:46 PM

Quote from: jj2007 on December 12, 2012, 06:53:03 PM
Frktons I Step / 2 GPRs and Frktons I Step / 4 GPRs are remarkably fast but you should have a look at the output.

I didn't check and it is possible that some instructions are not correct,
I was first testing for speed, and now I'm going to check for size optimization
and correctness of code. A few days more, I'm quite slow indeed.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 12, 2012, 11:25:42 PM

lol
it has to work first, otherwise you are comparing apples with oranges in the timing tests :t

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 13, 2012, 05:34:04 AM

Quote from: dedndave on December 12, 2012, 11:25:42 PM
lol
it has to work first, otherwise you are comparing apples with oranges in the timing tests :t

Here we are, tested and posted. The performances don't change,
there were some typing errors and adding one instead of two somewhere.
These errors didn't impact on performance, but on results they did. :P

I included a PROC to display the content of the filled array, on demand. :t

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 13, 2012, 06:50:59 AM

prescott w/htt
------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2246 cycles for Dedndave code - 5 GPRs
1964 cycles for Frktons I Step / 2 GPRs
1999 cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
2451 cycles for Frktons II Step / 5 MMX without SSSE3
1122 cycles for Frktons III Step / XMM/MMX with SSE2
1167 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2240 cycles for Dedndave code - 5 GPRs
1939 cycles for Frktons I Step / 2 GPRs
2015 cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
2588 cycles for Frktons II Step / 5 MMX without SSSE3
1115 cycles for Frktons III Step / XMM/MMX with SSE2
1167 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 13, 2012, 07:10:26 AM

Two out of three goals are accomplished, now the last, but not least,
optimization: code size and some small improvement, if needed. ;)

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 13, 2012, 11:46:51 AM

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz

Instructions: MMX, SSE1, SSE2, SSE3
--------------------------------------------------------
1740 cycles for Dedndave code - 5 GPRs
1348 cycles for Frktons I Step / 2 GPRs
1262 cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
943 cycles for Frktons II Step / 5 MMX without SSSE3
941 cycles for Frktons III Step / XMM/MMX with SSE2
728 cycles for Jochen / 5 XMM
--------------------------------------------------------
1742 cycles for Dedndave code - 5 GPRs
1350 cycles for Frktons I Step / 2 GPRs
1262 cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
944 cycles for Frktons II Step / 5 MMX without SSSE3
937 cycles for Frktons III Step / XMM/MMX with SSE2
728 cycles for Jochen / 5 XMM

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 14, 2012, 09:58:46 AM

On my home PC the III step is a bit faster than Jochen's code,
and in my office PC it is even faster.

Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
-------------------------------------------------------------
813 cycles for Frktons III Step / XMM/MMX with SSE2
835 cycles for Jochen / 5 XMM

Quote
Intel(R) Pentium(R) CPU G6950 @ 2.80GHz
-------------------------------------------------------------
...
405 cycles for Frktons III Step / XMM/MMX with SSE2
644 cycles for Jochen / 5 XMM

I'm actually studing a faster/smaller solution because, as Jochen said:

Quote from: jj2007 on December 09, 2012, 08:25:09 PM

P.S.: If you don't agree with the suffix "FINAL", write a faster algo :bgrin:

I don't agree with the suffix, I'm not at the FINAL stage so far. :lol:

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on December 14, 2012, 01:46:03 PM

Quote from: frktons on December 14, 2012, 09:58:46 AM
On my home PC the III step is a bit faster than Jochen's code,
and in my office PC it is even faster.
...
I'm actually studing a faster/smaller solution because, as Jochen said:

Quote from: jj2007 on December 09, 2012, 08:25:09 PM

P.S.: If you don't agree with the suffix "FINAL", write a faster algo :bgrin:
I don't agree with the suffix, I'm not at the FINAL stage so far. :lol:

So the incentive worked :greensml: :t

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 14, 2012, 02:59:18 PM

Quote from: jj2007 on December 14, 2012, 01:46:03 PM
So the incentive worked :greensml: :t

:lol:

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 15, 2012, 04:35:30 AM

This could be my final test, if you don't have any suggestion
to enhance the performance of the last code:

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1915 cycles for Dedndave code - 5 GPRs
1890 cycles for Frktons I Step / 2 GPRs
1964 cycles for Frktons I Step / 4 GPRs with LEA
1114 cycles for Frktons II Step / 5 MMX with SSSE3
1199 cycles for Frktons II Step / 5 MMX without SSSE3
811 cycles for Frktons III Step / XMM/MMX with SSE2
630 cycles for Frktons III Step / XMM with SSE2 - enhanced
706 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1915 cycles for Dedndave code - 5 GPRs
1896 cycles for Frktons I Step / 2 GPRs
1978 cycles for Frktons I Step / 4 GPRs with LEA
1110 cycles for Frktons II Step / 5 MMX with SSSE3
1199 cycles for Frktons II Step / 5 MMX without SSSE3
813 cycles for Frktons III Step / XMM/MMX with SSE2
628 cycles for Frktons III Step / XMM with SSE2 - enhanced
704 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on December 15, 2012, 04:56:58 AM

prescott w/htt

Quote------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2267 cycles for Dedndave code - 5 GPRs
1957 cycles for Frktons I Step / 2 GPRs
2035 cycles for Frktons I Step / 4 GPRs with LEA
Frktons II Step requires a PC with SSSE3
2459 cycles for Frktons II Step / 5 MMX without SSSE3
1126 cycles for Frktons III Step / XMM/MMX with SSE2
1197 cycles for Frktons III Step / XMM with SSE2 - enhanced
1159 cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2282 cycles for Dedndave code - 5 GPRs
1967 cycles for Frktons I Step / 2 GPRs
2031 cycles for Frktons I Step / 4 GPRs with LEA
Frktons II Step requires a PC with SSSE3
2483 cycles for Frktons II Step / 5 MMX without SSSE3
1126 cycles for Frktons III Step / XMM/MMX with SSE2
1185 cycles for Frktons III Step / XMM with SSE2 - enhanced
1158 cycles for Jochen / 5 XMM
------------------------------------------------------------------------

Title: Re: The fastest way to fill a dword array with string values
Post by: six_L on December 15, 2012, 05:02:10 AM

Quote------------------------------------------------------------------------
Intel(R) Core(TM) i3 CPU M 370 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
------------------------------------------------------------------------
1335   cycles for Dedndave code - 5 GPRs
1855   cycles for Frktons I Step / 2 GPRs
1009   cycles for Frktons I Step / 4 GPRs with LEA
587   cycles for Frktons II Step / 5 MMX with SSSE3
643   cycles for Frktons II Step / 5 MMX without SSSE3
389   cycles for Frktons III Step / XMM/MMX with SSE2
344   cycles for Frktons III Step / XMM with SSE2 - enhanced
460   cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1278   cycles for Dedndave code - 5 GPRs
1025   cycles for Frktons I Step / 2 GPRs
1015   cycles for Frktons I Step / 4 GPRs with LEA
585   cycles for Frktons II Step / 5 MMX with SSSE3
633   cycles for Frktons II Step / 5 MMX without SSSE3
400   cycles for Frktons III Step / XMM/MMX with SSE2
345   cycles for Frktons III Step / XMM with SSE2 - enhanced
437   cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 15, 2012, 05:04:18 AM

As usual, different CPUs = different performances :P

Considering I'm targeting Core Duo upwards, I can be satisfied
that I reached less than 0.7 cycles for each dword string. :biggrin:

Title: Re: The fastest way to fill a dword array with string values
Post by: Farabi on December 23, 2012, 09:24:58 AM

ahhh nice algo

(http://3.bp.blogspot.com/-KBh4qCwokmo/UFIpFNQicqI/AAAAAAAAODM/ye726PVWUfE/s1600/laptop-thief.jpg)

Title: Re: The fastest way to fill a dword array with string values
Post by: frktons on December 23, 2012, 12:06:14 PM

Quote from: Farabi on December 23, 2012, 09:24:58 AM
ahhh nice algo

Thanks my friend. The application of unrolling and SSE2 code
produce this fast algo. :t

Title: Re: The fastest way to fill a dword array with string values / prime generator
Post by: flaschenpost on June 22, 2013, 10:58:12 AM

Hi, sorry for bumping in - and being late.

I stumbled over that wonderful thread because I was unhappy with the timing of a simple prime number generator. No math 8) or pun intended, the slow part can be simplified for the show as:

* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...)

That classical C-Code ( bytelist[j/64] |= 1UL<<(j%64) ) seems to take quite some time, I come around 0.04s on a Intel(R) Core(TM)2 Duo CPU T5750 @ 2.00GHz. Maybe that is optimized, and I am hunting a pink unicorn.

I do not know asm yet, I do not yet use masm. All I have is that feeling that for setting every 10th bit should be a smarter way than those presented by gcc or clang.

It's just for fun. I even tried parallelisation on the Cuda side of life (in C), but I seem to make all of those terrible mistakes of the unblessed beginners - it gets much slower than the same partitioned code on plain CPU. One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

I think SSE3 or SSSE3 should make it easy, and even IntelHD on the other computer (Core i5) should have enough cpu hardware units to set a few bits in parallel.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on June 22, 2013, 12:02:40 PM

i would try to get away from multiple threads

there may be a faster way in SSE - i am not familiar with SSE shifts and rotates
but, you might think of something along this line

Code Select

    mov     eax,ecx  ;every 10th bit from ECX
    mov     edx,ebp  ;every 10th bit from EBP
    or      eax,esi  ;every 14th bit from ESI
    or      edx,ebx  ;every 14th bit from EBX
    mov     [edi],eax
    mov     [edi+4],edx
;do rotates here
    add     edi,8
;loop to top

Title: Re: The fastest way to fill a dword array with string values
Post by: qWord on June 22, 2013, 12:04:56 PM

If you can determine the period of the patterns (related to 128 bit blocks), it should be really simple to OR combine the pre-calculated patterns with the target bytes (e.g. using the SSE2 instr. POR).

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on June 22, 2013, 12:06:48 PM

oh yah - a look-up table could have the patterned data

Title: Re: The fastest way to fill a dword array with string values
Post by: Antariy on June 22, 2013, 01:03:28 PM

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
It's just for fun. I even tried parallelisation on the Cuda side of life (in C), but I seem to make all of those terrible mistakes of the unblessed beginners - it gets much slower than the same partitioned code on plain CPU. One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

I think SSE3 or SSSE3 should make it easy, and even IntelHD on the other computer (Core i5) should have enough cpu hardware units to set a few bits in parallel.

Probably that's because Cuda (and SSE) are designed for really simple number crunching commands, not for the stuff like bits manipulation. I.e., they work well with float/probably integer datatypes, which can be processed well in parallel - image processing etc, but will probably slower than CPU in things like bits bits manipulation. Maybe the solution for the speed up on the CPU is to parallelise bits-in-bytes manipulation - i.e., prepare bytes of an array independedly, step by step and these steps should be intermixed, like "work with bit 1 of byte 1, work with bit 1 of byte 2, work with bit 1 of byte 3" then "work with bit 2 of byte 1, work with bit 2 of byte 2, work with bit 2 of byte 3" and so on, then put the combined bytes into array.

(There was is just an assumption about CUDA, did not work with it at all really).

Title: Re: The fastest way to fill a dword array with string values / prime generator
Post by: FORTRANS on June 22, 2013, 10:07:12 PM

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

Hi,

Well in assembly you need not load, modify, and store. You can
modify the memory directly. And that would avoid resetting any
set bits.

Code Select


        MOV     EAX,[Memory]    ; Load
        OR      EAX,BitPattern  ; Modify
        MOV     [Memory],EAX    ; Store

; can become
        OR      DWORD PTR [Memory],BitPattern

Regards,

Steve N.

Title: Re: The fastest way to fill a dword array with string values / prime generator
Post by: jj2007 on June 23, 2013, 02:30:44 AM

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...

As qWord already mentioned, find the right periods and go ahead. For "every 10th bit", 80 bytes is the period - 5*16, or 2*32+1*16. Every 14th bit is trickier, of course - 8*14=?

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on June 23, 2013, 03:45:58 AM

i'm not sure i understand the patterns - lol

7, 17, 29 are primes
10, 15, 20, 24, 27, 30, 38 are not

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 23, 2013, 06:34:42 AM

Thanks for all the suggestions, I also thought about using the lookup-tables. They sound great for 2*5 and 2*7, but I am suspicious about the bytewise operation (is there some 8-bit logik still around?), so it might get to 2*5*64 bit and 2*7*64 bit for the first two numbers, and then up it goes (2*11*64, 2*13*64). At least in my C - proof of concept they are much slower than the (gcc-optimized) naive solutions.

@dedndave: well, the trivial non-primes I just won't store, so I don't care about even numbers or multiples of 3. The list simply gets

5 7 11 13 17 19 23 25 29 31 35 37 ...
0 1 2 3 4 5 6 7 8 9 10 11 ...

The marking of all numbers devidible by 5 is marking every 10th bit starting from 7 (which is 25) and from 10 (which is 35). You see, 25 + 5 makes 30 and is even, so no need to check that. 35 is marked with bit 10, 55 with bit 17.

But maybe I am really near the possible optimum. Coming from all the high-level-languages I "thought" there would be some magic working with bit-Stream-like sources and combining them in big chunks with "OR".

I definitely will play with the SSE-Registers.

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on June 23, 2013, 06:51:25 AM

Not sure if SSE will bring the big improvement, but...

You have two overlaid loops to handle, step 10 and step 14 bits. For speed, you would need to use (many) registers to set the bits and minimise mem-to-mem operations. It may look roughly like this in the end:

Entry:
xor eax, eax ; same as mov eax, 0
xor ebx, ebx
xor ecx, ecx
...
or reg32, whatever ; set bits for step 10
...
or reg32, whatever ; set bits for step 14
...
mov [edi], eax ; write regs to mem
mov [edi+4], ebx
mov [edi+8], ecx
mov [edi+12], edx

Now it would be great if you could post the fast gcc solution, with an __asm int 3; on entry to the innermost loop, so that we could check at assembly level what makes it fast. As a rule of thumb, hand-crafted assembly can be at least a factor 2 faster.

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 23, 2013, 10:13:28 PM

@jj2007: here the code, I tried gcc with option -masm=intel.

Unfortunately __asm int 3; does not work with my linux-gcc. I inserted it manually at line 82, function doWork, before loop .L11 - but forgot after recompile ::)

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on June 23, 2013, 10:42:55 PM

SSE may not help much, because memory accesses are limited by hardware
but - it may help a little, because you can write 128 bits at a whack

the idea would be to make a pattern table for one set and a pattern table for the other set
then, use SSE code to OR the appropriate pattern values together, then store them

even so.....
i think you said your current code takes 40 mS to write a GB
that's not too bad, really :P

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on June 24, 2013, 03:39:58 AM

Quote from: flaschenpost on June 23, 2013, 10:13:28 PM
@jj2007: here the code, I tried gcc with option -masm=intel.

I don't have gcc. Can you post the executable, please?

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 24, 2013, 05:03:16 AM

There is a tn.s inside the zip. I hoped that would be compatible, but jwasm does not accept it.

Does anybody know how to compile C - Code into assembly for masm? The relevant part looks like

doWork:
.LFB64:
   .cfi_startproc
   movq   %rcx, %rax
   shrq   %rax
   cmpq   %rax, %rdx
   ja   .L10
   addq   %r8, %r8
   movl   $1, %r10d
   .p2align 4,,10
   .p2align 3
.L11:
   movq   %rsi, %r9
   movl   %esi, %ecx
   movq   %r10, %r11
   shrq   $6, %r9
   salq   %cl, %r11
   movl   %edx, %ecx
   orq   %r11, (%rdi,%r9,8)
   movq   %rdx, %r9
   movq   %r10, %r11
   shrq   $6, %r9
   addq   %r8, %rdx
   salq   %cl, %r11
   addq   %r8, %rsi
   orq   %r11, (%rdi,%r9,8)
   cmpq   %rax, %rdx
   jbe   .L11
.L10:
   cmpq   %rax, %rsi
   jae   .L9
   movq   %rsi, %rdx
   movl   $1, %eax
   movl   %esi, %ecx
   shrq   $6, %rdx
   salq   %cl, %rax
   orq   %rax, (%rdi,%rdx,8)
.L9:
   rep
   ret
   .cfi_endproc

I try to find a free compiler for win and come back then.

Title: Re: The fastest way to fill a dword array with string values
Post by: MichaelW on June 24, 2013, 09:46:49 AM

Using a gcc version 4.7.2 from a recent MinGW installation, Intel-syntax assembly is no problem, for inline assembly or for the assembly output.

Code Select


#include <stdio.h>
#include <conio.h>

double CalcArea(const double r)
{
    return r*r*3.14159;
}

int main (void)
{
    printf("%f\n", CalcArea(1));
    getch();
    __asm__("int 3\n");
}

Code Select


set file="test"
gcc -pedantic -S -masm=intel %file%.c -o %file%.asm 
pause
gcc -pedantic -masm=intel %file%.c -o %file%.exe
pause

Code Select


	.file	"test.c"
	.intel_syntax noprefix
	.text
	.globl	_CalcArea
	.def	_CalcArea;	.scl	2;	.type	32;	.endef
_CalcArea:
LFB6:
	.cfi_startproc
	push	ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	mov	ebp, esp
	.cfi_def_cfa_register 5
	sub	esp, 8
	mov	eax, DWORD PTR [ebp+8]
	mov	DWORD PTR [ebp-8], eax
	mov	eax, DWORD PTR [ebp+12]
	mov	DWORD PTR [ebp-4], eax
	fld	QWORD PTR [ebp-8]
	fmul	QWORD PTR [ebp-8]
	fld	QWORD PTR LC0
	fmulp	st(1), st
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
LFE6:
	.def	___main;	.scl	2;	.type	32;	.endef
	.section .rdata,"dr"
LC3:
	.ascii "%f\12\0"
	.text
	.globl	_main
	.def	_main;	.scl	2;	.type	32;	.endef
_main:
LFB7:
	.cfi_startproc
	push	ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	mov	ebp, esp
	.cfi_def_cfa_register 5
	and	esp, -16
	sub	esp, 16
	call	___main
	mov	eax, 0
	mov	edx, 1072693248
	mov	DWORD PTR [esp], eax
	mov	DWORD PTR [esp+4], edx
	call	_CalcArea
	fstp	QWORD PTR [esp+4]
	mov	DWORD PTR [esp], OFFSET FLAT:LC3
	call	_printf
	call	_getch
/APP
 # 13 "test.c" 1
	int 3

 # 0 "" 2
/NO_APP
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
LFE7:
	.section .rdata,"dr"
	.align 8
LC0:
	.long	-266631570
	.long	1074340345
	.def	_printf;	.scl	2;	.type	32;	.endef
	.def	_getch;	.scl	2;	.type	32;	.endef

Edit:

Somehow I missed your attachment. I added code to time the doWork procedure (Windows code that calls GetTickCount, sorry no time to do better) . The attachment contains the assembly output and EXEs for no optimization, -O1, -O2, -O3, and -Ofast. On both of my systems -O1 was the fastest.

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 25, 2013, 06:20:29 AM

I have to admit that I am not able to produce masm / jwasm - compatible assembly from my C-Code, I did not know how many different text formats for the same simple opcodes one could invent :( . I thought "Intel-format" would do the job. I use Intel C Compiler icc to optimize and I get factor 2 with converting to a lookup-table _and_ converting from a while-loop to a for-loop, which surprised me.

Since I can not find an assembler for the output of Intel-C-Compiler icc I can not produce dos-executables.

I begin to understand why a forum like this needs to be specific about OS and Tools.

I attached the current state of c and asm files. tn is "naive", tl is "looped".

The main problem stayes:

1. I can either produce simple loops, good for optimization and vectorization, but then I have to go through the memory more often

or

2. I can try to be memory-local (Cachelines etc), but then the loops get complicated.

From my understanding of processors it should be fastest to:

- fetch a cache line from memory as big as possible in a read

- write into that memory segment with many of those bit-settings (5 and 7 resp. every 10th and every 14th bit were just the examples of course, the steps get bigger up to 1e16)

- flush that memory segment back to main mem, i.e. tell the processor that line would not be needed again in near future.

Or, alternatively:

- fill the SSE-Registers with the pattern

- write directly from SSE with "or" into main mem if that is possible.

So some questions: Is it possible to rotate the 265 bit of a AVX - register or the 128 of a SSSE3 - register? Is it even possible to fetch all the SSE/AVX-registers from one memory region in one step, and to write them together? Can I bit-shift or bit-rotate through those registers in a small number of cycles?

If I profile my code with valgrind, then the main time is needed for memory access (surprise, surprise!). But if I try to reduce the loops over memory by processing packets of memory together, then I get slower than the naive solution.

The question of the solution with a small number of cycles seems to be almost as complex as the nowadays processors themselves.

@JJ2007: I was slightly misunderstood - I did not want you to compile with gcc, but I tried to make my gcc produce a "good" assembly file (.s in the flasch0.zip). What toolstack do you use when you have a peace of C-code to optimize? Or do you really write the whole software directly with in assembly and with your own libs?

I have even blown the dust off my old Windows- Partition (XP) and tried a VC++ Express, but got not too far into it (some trouble with a precompiled header that does not work and that I can not really get rid of).

The funny thing is: version tn.c is slower than tl.c in that pure form, but when I try to pack it (even only for p=5 and p=7) into the bigger loop, it makes the whole process slower instead of faster.

I am still frustrated that involving the 16 "machines" of my Nvidia Cuda did not speedup the marking of a rhythm of bits. Probably I have some big mistakes with those memory levels in Cuda.

I think using 4 threads for 4 parts of memory should use the 2 cores á 2 hyperthreads and speedup things, but maybe memory read/write is the bottleneck and threading makes it all worse.

Many thanks for the suggenstions and discussions so far, and I hope that the question will bring some nice thoughts from the playful experts here!

If any questions arise, I will try to answer them...

Title: Re: The fastest way to fill a dword array with string values
Post by: hool on June 25, 2013, 06:26:06 AM

is this 128bit sequence correct? every 10th bit starting at bit 7?
0'0000001'0000000001'0000000001'0000000001'00 00000001'0000000001'0000000001'0000000001'0000000001'0000000001'0000000001'0000000001'0000000001

byte0=1
byte1=0
byte2=64
byte3=16
byte4=4
byte5=1

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on June 25, 2013, 06:59:23 AM

Quote from: flaschenpost on June 25, 2013, 06:20:29 AM
@JJ2007: I was slightly misunderstood - I did not want you to compile with gcc, but I tried to make my gcc produce a "good" assembly file (.s in the flasch0.zip). What toolstack do you use when you have a peace of C-code to optimize? Or do you really write the whole software directly with in assembly and with your own libs?

My mistake - I should have looked more carefully at the .s file...
This seems to be the innermost loop:
.L11:
   mov   r9, rsi
   mov   ecx, esi
   mov   r11, r10
   shr   r9, 6
   sal   r11, cl
   mov   ecx, edx
   or   QWORD PTR [rdi+r9*8], r11
   mov   r9, rdx
   mov   r11, r10
   shr   r9, 6
   add   rdx, r8
   sal   r11, cl
   add   rsi, r8
   or   QWORD PTR [rdi+r9*8], r11
   cmp   rdx, rax
   jbe   .L11

That looks slow, honestly. I am not so proficient in 64-bit assembly, but in general we try to avoid shifts because they are slow, apart from other design problems in that loop.

Re toolstack: Masm32 is a library of very fast routines, compared to standard CRT; MasmBasic is another library with often much faster algos, and for specific tasks you will find many more in the "Laboratory" section. And yes we do write every algo from scratch, and we all find it particularly amusing to beat a CRT algo by a factor 3 and higher :icon_mrgreen:

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 25, 2013, 04:43:46 PM

Quote from: jj2007 on June 25, 2013, 06:59:23 AM

That looks slow, honestly. I am not so proficient in 64-bit assembly, but in general we try to avoid shifts because they are slow, apart from other design problems in that loop.

Well, for setting a certain bit in a byte I can't see an alternative. Following http://gmplib.org/~tege/x86-timing.pdf seemed not too bad for shifts, but they do not measure quad word operations.

Do you see a better translation of:

while(k <= limit/2){
C[j / 64] |= (1L << (j % 64));
C[k / 64] |= (1L << (k % 64));
j +=2*p;
k+=2*p;
}
if(j < limit/2){
C[j / SW] |= ((ST)1 << (j % SW));
}

Since the values in the lookup table are only right shifted values of a starting entry I even thought about replacing the lookup table (higher value for p gets big tables) with shift operations. But you sound like a lookup in a local long[120] is faster than shifting the same value.

Quote from: jj2007 on June 25, 2013, 06:59:23 AM
Re toolstack: Masm32 is a library of very fast routines, compared to standard CRT; MasmBasic is another library with often much faster algos, and for specific tasks you will find many more in the "Laboratory" section. And yes we do write every algo from scratch, and we all find it particularly amusing to beat a CRT algo by a factor 3 and higher :icon_mrgreen:

Wow, sounds like a nice game. :)

I think for an optimal prime number sieve there would be needed techniques like in Blas/Atlas: they measure some characteristics of the processor and system and then create an optimizer profile for just that system, so maximizing cache usage, parallelization etc. What impressed my was at the early 2000s, there were really expensive BLAS-libs out there from the vendors like Sun and IBM, hand-optimized for their system, but ATLAS made it as a general package right into the same range of speed.

But nobody really needs a fast prime sieve, so it is not worth that amount of work. ;-)

The strange thing about your "shift is slow" is that replacing a lot of shifts just with the predefined pattern makes "only" factor 2.

The (short) preparation in the fastest compiled version right now is

# Begin ASM
# prepare patterns
# End ASM
# LOE rbx rbp r12 r13 r14 r15
..B2.16: # Preds ..B2.2
movq %r14, %rax #39.21
shlq $6, %rax #39.21
addq $64, %rax #39.21
cmpq %rax, %rbp #39.21
jae ..B2.6 # Prob 10% #39.21
# LOE rax rbx rbp r12 r13 r14 r15
..B2.4: # Preds ..B2.16 ..B2.4
movq %r12, %rsi #40.15
movq %rbp, %r8 #43.15
shrq $6, %rsi #40.15
movl %r12d, %ecx #40.37
shrq $6, %r8 #43.15
movl $1, %edx #40.37
shlq %cl, %rdx #40.37
movl %ebp, %ecx #43.37
movl $1, %edi #43.37
lea (%rbp,%r14,2), %rbp #45.9
orq %rdx, (%r15,%rsi,8) #40.9
lea (%r12,%r14,2), %r12 #42.9
orq %rdx, (%rsp,%rsi,8) #41.9
shlq %cl, %rdi #43.37
orq %rdi, (%r15,%r8,8) #43.9
orq %rdi, (%rsp,%r8,8) #44.9
cmpq %rax, %rbp #39.21
jb ..B2.4 # Prob 95% #39.21
# LOE rax rbx rbp r12 r13 r14 r15
..B2.6: # Preds ..B2.4 ..B2.16
cmpq %rax, %r12 #47.18
jae ..B2.8 # Prob 50% #47.18
# LOE rbx r12 r13 r14 r15
..B2.7: # Preds ..B2.6
movq %r12, %rdx #49.15
movl %r12d, %ecx #49.37
shrq $6, %rdx #49.15
movl $1, %eax #49.37
shlq %cl, %rax #49.37
orq %rax, (%r15,%rdx,8) #49.9
orq %rax, (%rsp,%rdx,8) #50.9
# LOE rbx r13 r14 r15
..B2.8: # Preds ..B2.6 ..B2.7
movq (%rsp,%r14,8), %rax #53.17
movq %rax, (%rsp) #53.5
shrq $7, %r13 #55.22
# LOE rbx r13 r14 r15
..B2.18: # Preds ..B2.8

and the real loop is simply a looped memcopy:

# Begin ASM
# main loop
# End ASM
# LOE rbx r13 r14 r15
..B2.17: # Preds ..B2.18
movq %r14, %rcx #57.9
cmpq %r13, %r14 #57.22
jae ..B2.12 # Prob 10% #57.22
# LOE rcx rbx r13 r14 r15
..B2.10: # Preds ..B2.17 ..B2.10
movq %rcx, %rax #59.30
xorl %edx, %edx #59.30
divq %r14 #59.30
movq (%rsp,%rdx,8), %rbp #59.19
orq %rbp, (%r15,%rcx,8) #59.9
addq $2, %rcx #60.9
cmpq %r13, %rcx #57.22
jb ..B2.10 # Prob 95% #57.22
# LOE rcx rbx r13 r14 r15
..B2.12: # Preds ..B2.10 ..B2.17
# Begin ASM
# end of main loop
# End ASM

For the higher primes (>64) this makes no sense, because there would be write accesses not on every DWORD in the naive solution.

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 26, 2013, 05:16:50 PM

Well, my recent version comes down to a memcpy-like operation to logical combine a memory range out in RAM with a small memory from stack or something like that. So instead of

memcpy(C + offset, local, count)

I would need a

mem_or(C + offset, local, count)

to replace the long for-loop.

Is here something like that? Did you even speedup memcpy or memfill?

AVX is out of the race, it can't bit-shift. Otherwise it would have been _really_ great.

Even AVX2 seems not be able to bit-rotate the whole 256 bits - what a pity for me... If that would have been built in, I could store the unshifted bitpatterns for 15 primes in 256-bit-length YMM0 .. YMM14, rotate the the register to the right place and "OR" it with and into YMM15. That would mean 256 Bit in a few clock cycles for the first 15 primes, plus the shift and the modulo-logic around for 15 tinyints.

But extracting the register, shifting dword by dword with parity, restoring it and _then_ use the 256-bit-OR seems really silly.

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on June 27, 2013, 12:50:43 AM

SSE should be the way to go, i would think

one problem is the way you start out

Quote* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...)

let's make it fast and say

Quote* to mark every 10th bit starting with bits 7 and 0 in a GB of memory. (7,0,17,10,27,20,...)
* also to mark every 14th bit starting with bits 1 and 10 (1,10,15,24, 29, 38, ...)
* at the end, go back and correct the first few bytes :)

this is the OWORD bit pattern for the first line of description, shown in little-endian form

Code Select

10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001
00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000
01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000
00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010
00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100

the pattern repeats after the 5th OWORD

the pattern for the second line of description repeats after the 7th OWORD
make 2 tables, one with 5 OWORD's, one with 7 OWORD's
POR them together and write
adjust the 3 pointers (one for table 1, one for table 2, one for the destination)
rinse and repeat
when you're done, go back and straighten out the first few bytes

it would seem to me that the entire array is, then, a repeat of the first 35 OWORD's :biggrin:
so - it could be even faster
just create the first 280 bytes, and copy it until you are happy - lol

a little faster, still....
create the first 280 bytes
copy it several times to make a larger block
copy the larger block several times until the gig is full

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 27, 2013, 05:24:41 AM

@dedndave: yeah, that was what I tried in C but it made my code not faster, but only more complicated (maybe because I have no processor and asm experience). :( Thanks anyway for strenghening my mood in that direction!

The critical part seems to be the "OR" of two memory blocks, since 5 and 7 were only the starting point, and when I stir in the 11, 13 and 17 into the soup the blocks get really large (and since they are prime numbers, alignment can be achieved only by another 64 in the list), so it has to get a fast solution to "OR" a local memory block with the gig out there.

I read a quite strange thread about 64-bit and xmemcopy, so I will try to adopt ideas from there to OR blocks. I have made a big improvement (in C) by unrolling the loop and doing 128 long-OR in a row.

So still needed:

optimized code to memfill blocks of data (cache? stack?)
optimized code to set several bits through a big memory region (cache? registers? stack?

I try to learn enough assembler to write any working code!

Title: Re: The fastest way to fill a dword array with string values
Post by: jj2007 on June 27, 2013, 07:42:56 AM

You basically need one read-only memory block that or's the two 10 and 14 bit sequences (I have started the exercise below). You then combine them into a single block with 7*5*8 =280 bytes.
272 of them can be accessed in 16-byte steps, then you need an 8-byte write. Or you choose a 560 read area to access all of them in 16-byte steps.

Speed depends then on how much of that read memory you can squeeze into registers - xmm or ymm. With 8 xmm regs, you can cover 7*16=112 bytes, the rest must be read from memory with remaining xmm. With ymm regs, coverage gets better but still you can't arrive at 560 bytes.

After that, it's simply or'ing the Gigabyte with a battery of registers. There is a MOVNTDQ (Move Double Quadword Non-Temporal) instruction that is worth studying, but afaik no similar instruction is available for or'ing memory.

HTH

.data
;x 10987654321098765432109876543210___32109876543210987654321098765432___54321098765432109876543210987654
Bits10 dd 01001000000100100000010010000000b, 00010010000001001000000100100000b, 00000100100000010010000001001000b

;x 76543210987654321098765432109876___98765432109876543210987654321098
dd 10000001001000000100100000010010b, 00100000010010000001001000000100b

;x 10987654321098765432109876543210___32109876543210987654321098765432___54321098765432109876543210987654
Bits14 dd 00100001000000001000000000000000b, 00000000000000000000100001000000b, 00000000000000000000000000000000b

;x 76543210987654321098765432109876___98765432109876543210987654321098
dd 00000000000000000000000000000000b, 00000000000000000000000000000000b

;x 76543210987654321098765432109876___98765432109876543210987654321098
dd 00000000000000000000000000000000b, 00000000000000000000000000000000b

;x 76543210987654321098765432109876___98765432109876543210987654321098
dd 00000000000000000000000000000000b, 00000000000000000000000000000000b

Title: Re: The fastest way to fill a dword array with string values
Post by: dedndave on June 27, 2013, 10:47:43 AM

i guess you are a member of Euler ?

what i had in mind was to generate the data once and store it in a file
then - read the file into memory - lol

no need to generate the table every time you want to use it

Title: Re: The fastest way to fill a dword array with string values
Post by: flaschenpost on June 27, 2013, 09:27:50 PM

Quote from: dedndave on June 27, 2013, 10:47:43 AM
i guess you are a member of Euler ?

Well, yes that also is true, but this one was a StackOverflow question (about parallelizing, I guess it was a student's homework) triggering my old enthusiasm.

I just started my old program for a 1<<32 Maximum and sat there waiting for marking of number "3", and waited and sat, and waited...

But from that point on the search for a highly efficient solution got its own game.

I think I am still in the range of calculating a GB not much slower than reading it, apart from the disc space saved. ;-)

But in the end it should be a streaming solution that writes as many GB as I want.

Once and For All. ;)

Title: Re: The fastest way to fill a dword array with string values
Post by: hool on June 27, 2013, 10:17:13 PM

considering I didn't mess up the table:

Code Select

        align 16
every10th_at7:
        ;-----------------  prolog:
        db 00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b
        db 00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b
        ;-----------------  main loop starts:
        db 00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b
        db 00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b
        db 01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b
        db 00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b
        db 00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b
        ; ----------------  unroll for a 16byte multiples writes:
        db 00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b
        db 00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b
        db 01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b
        db 00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b
        db 00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b

Code Select


        mov     rsi, _buff1

        ; rsi points to a buffer

        mov     rax, [every10th_at7]
        mov     rdx, [every10th_at7+8]
        or      [rsi], rax
        or      [rsi+8], rdx
        add     rsi, 16

        mov     rax, [every10th_at7+16]
        mov     rdx, [every10th_at7+24]
        mov     rdi, [every10th_at7+32]
        mov     rbx, [every10th_at7+40]
        mov     rbp, [every10th_at7+48]

        ; number of 40byte blocks
        mov     ecx, 10
@@:
        or     [rsi], rax
        or     [rsi+8], rdx
        or     [rsi+16], rdi
        or     [rsi+24], rbx
        or     [rsi+32], rbp
        add    rsi, 40
        dec    ecx
        jnz    @b

solution 2, ilustrates limitation of "x86 32bit mode" and "POR"

Code Select


        mov     esi, _buff1
        ; esi points to 16 byte aligned buffer

        mov     eax, [every10th_at7]
        mov     edx, [every10th_at7+4]
        mov     ecx, [every10th_at7+8]
        mov     edi, [every10th_at7+12]
        or      [esi], eax
        or      [esi+4], edx
        or      [esi+8], ecx
        or      [esi+12], edi
        add     esi, 16

        movdqa  xmm0, [every10th_at7+16]
        movdqa  xmm1, [every10th_at7+32]
        movdqa  xmm2, [every10th_at7+48]
        movdqa  xmm3, [every10th_at7+64]
        mov     eax, [every10th_at7+80]
        mov     edx, [every10th_at7+84]
        mov     ebx, [every10th_at7+88]
        mov     edi, [every10th_at7+92]


        ; number of 80 byte blocks
        mov     ecx, 5
@@:
        movdqa  xmm4, xmm0
        movdqa  xmm5, xmm1
        movdqa  xmm6, xmm2
        movdqa  xmm7, xmm3
        por     xmm4, [esi]
        por     xmm5, [esi+16]
        por     xmm6, [esi+32]
        por     xmm7, [esi+48]
        or      [esi+64], eax
        or      [esi+68], edx
        or      [esi+72], ebx
        or      [esi+76], edi
        movdqa  [esi], xmm4
        movdqa  [esi+16], xmm5
        movdqa  [esi+32], xmm6
        movdqa  [esi+48], xmm7
        add     esi, 80
        dec     ecx
        jnz     @b

EDIT, this code had incorrect offsets

Code Select


mov     rax, [every10th_at7]
        mov     rdx, [every10th_at7+8]
        mov     rdi, [every10th_at7+16]
        mov     rbx, [every10th_at7+24]
        mov     rbp, [every10th_at7+32]

Text only | Text with Images

SMF 2.1.4 © 2023, Simple Machines