The fastest way to fill a dword array with string values

flaschenpost · June 22, 2013, 10:58:12 AM

Hi, sorry for bumping in - and being late.

I stumbled over that wonderful thread because I was unhappy with the timing of a simple prime number generator. No math 8) or pun intended, the slow part can be simplified for the show as:

* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...)

That classical C-Code ( bytelist[j/64] |= 1UL<<(j%64) ) seems to take quite some time, I come around 0.04s on a Intel(R) Core(TM)2 Duo CPU T5750 @ 2.00GHz. Maybe that is optimized, and I am hunting a pink unicorn.

I do not know asm yet, I do not yet use masm. All I have is that feeling that for setting every 10th bit should be a smarter way than those presented by gcc or clang.

It's just for fun. I even tried parallelisation on the Cuda side of life (in C), but I seem to make all of those terrible mistakes of the unblessed beginners - it gets much slower than the same partitioned code on plain CPU. One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

I think SSE3 or SSSE3 should make it easy, and even IntelHD on the other computer (Core i5) should have enough cpu hardware units to set a few bits in parallel.

dedndave · June 22, 2013, 12:02:40 PM

i would try to get away from multiple threads

there may be a faster way in SSE - i am not familiar with SSE shifts and rotates
but, you might think of something along this line

Code Select

    mov     eax,ecx  ;every 10th bit from ECX
    mov     edx,ebp  ;every 10th bit from EBP
    or      eax,esi  ;every 14th bit from ESI
    or      edx,ebx  ;every 14th bit from EBX
    mov     [edi],eax
    mov     [edi+4],edx
;do rotates here
    add     edi,8
;loop to top

qWord · June 22, 2013, 12:04:56 PM

If you can determine the period of the patterns (related to 128 bit blocks), it should be really simple to OR combine the pre-calculated patterns with the target bytes (e.g. using the SSE2 instr. POR).

dedndave · June 22, 2013, 12:06:48 PM

oh yah - a look-up table could have the patterned data

Antariy · June 22, 2013, 01:03:28 PM

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
It's just for fun. I even tried parallelisation on the Cuda side of life (in C), but I seem to make all of those terrible mistakes of the unblessed beginners - it gets much slower than the same partitioned code on plain CPU. One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

I think SSE3 or SSSE3 should make it easy, and even IntelHD on the other computer (Core i5) should have enough cpu hardware units to set a few bits in parallel.

Probably that's because Cuda (and SSE) are designed for really simple number crunching commands, not for the stuff like bits manipulation. I.e., they work well with float/probably integer datatypes, which can be processed well in parallel - image processing etc, but will probably slower than CPU in things like bits bits manipulation. Maybe the solution for the speed up on the CPU is to parallelise bits-in-bytes manipulation - i.e., prepare bytes of an array independedly, step by step and these steps should be intermixed, like "work with bit 1 of byte 1, work with bit 1 of byte 2, work with bit 1 of byte 3" then "work with bit 2 of byte 1, work with bit 2 of byte 2, work with bit 2 of byte 3" and so on, then put the combined bytes into array.

(There was is just an assumption about CUDA, did not work with it at all really).

FORTRANS · June 22, 2013, 10:07:12 PM

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

Hi,

Well in assembly you need not load, modify, and store. You can
modify the memory directly. And that would avoid resetting any
set bits.

Code Select


        MOV     EAX,[Memory]    ; Load
        OR      EAX,BitPattern  ; Modify
        MOV     [Memory],EAX    ; Store

; can become
        OR      DWORD PTR [Memory],BitPattern

Regards,

Steve N.

jj2007 · June 23, 2013, 02:30:44 AM

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...

As qWord already mentioned, find the right periods and go ahead. For "every 10th bit", 80 bytes is the period - 5*16, or 2*32+1*16. Every 14th bit is trickier, of course - 8*14=?

dedndave · June 23, 2013, 03:45:58 AM

i'm not sure i understand the patterns - lol

7, 17, 29 are primes
10, 15, 20, 24, 27, 30, 38 are not

flaschenpost · June 23, 2013, 06:34:42 AM

Thanks for all the suggestions, I also thought about using the lookup-tables. They sound great for 2*5 and 2*7, but I am suspicious about the bytewise operation (is there some 8-bit logik still around?), so it might get to 2*5*64 bit and 2*7*64 bit for the first two numbers, and then up it goes (2*11*64, 2*13*64). At least in my C - proof of concept they are much slower than the (gcc-optimized) naive solutions.

@dedndave: well, the trivial non-primes I just won't store, so I don't care about even numbers or multiples of 3. The list simply gets

5 7 11 13 17 19 23 25 29 31 35 37 ...
0 1 2 3 4 5 6 7 8 9 10 11 ...

The marking of all numbers devidible by 5 is marking every 10th bit starting from 7 (which is 25) and from 10 (which is 35). You see, 25 + 5 makes 30 and is even, so no need to check that. 35 is marked with bit 10, 55 with bit 17.

But maybe I am really near the possible optimum. Coming from all the high-level-languages I "thought" there would be some magic working with bit-Stream-like sources and combining them in big chunks with "OR".

I definitely will play with the SSE-Registers.

jj2007 · June 23, 2013, 06:51:25 AM

Not sure if SSE will bring the big improvement, but...

You have two overlaid loops to handle, step 10 and step 14 bits. For speed, you would need to use (many) registers to set the bits and minimise mem-to-mem operations. It may look roughly like this in the end:

Entry:
xor eax, eax ; same as mov eax, 0
xor ebx, ebx
xor ecx, ecx
...
or reg32, whatever ; set bits for step 10
...
or reg32, whatever ; set bits for step 14
...
mov [edi], eax ; write regs to mem
mov [edi+4], ebx
mov [edi+8], ecx
mov [edi+12], edx

Now it would be great if you could post the fast gcc solution, with an __asm int 3; on entry to the innermost loop, so that we could check at assembly level what makes it fast. As a rule of thumb, hand-crafted assembly can be at least a factor 2 faster.

flaschenpost · June 23, 2013, 10:13:28 PM

@jj2007: here the code, I tried gcc with option -masm=intel.

Unfortunately __asm int 3; does not work with my linux-gcc. I inserted it manually at line 82, function doWork, before loop .L11 - but forgot after recompile ::)

dedndave · June 23, 2013, 10:42:55 PM

SSE may not help much, because memory accesses are limited by hardware
but - it may help a little, because you can write 128 bits at a whack

the idea would be to make a pattern table for one set and a pattern table for the other set
then, use SSE code to OR the appropriate pattern values together, then store them

even so.....
i think you said your current code takes 40 mS to write a GB
that's not too bad, really :P

jj2007 · June 24, 2013, 03:39:58 AM

Quote from: flaschenpost on June 23, 2013, 10:13:28 PM
@jj2007: here the code, I tried gcc with option -masm=intel.

I don't have gcc. Can you post the executable, please?

flaschenpost · June 24, 2013, 05:03:16 AM

There is a tn.s inside the zip. I hoped that would be compatible, but jwasm does not accept it.

Does anybody know how to compile C - Code into assembly for masm? The relevant part looks like

doWork:
.LFB64:
   .cfi_startproc
   movq   %rcx, %rax
   shrq   %rax
   cmpq   %rax, %rdx
   ja   .L10
   addq   %r8, %r8
   movl   $1, %r10d
   .p2align 4,,10
   .p2align 3
.L11:
   movq   %rsi, %r9
   movl   %esi, %ecx
   movq   %r10, %r11
   shrq   $6, %r9
   salq   %cl, %r11
   movl   %edx, %ecx
   orq   %r11, (%rdi,%r9,8)
   movq   %rdx, %r9
   movq   %r10, %r11
   shrq   $6, %r9
   addq   %r8, %rdx
   salq   %cl, %r11
   addq   %r8, %rsi
   orq   %r11, (%rdi,%r9,8)
   cmpq   %rax, %rdx
   jbe   .L11
.L10:
   cmpq   %rax, %rsi
   jae   .L9
   movq   %rsi, %rdx
   movl   $1, %eax
   movl   %esi, %ecx
   shrq   $6, %rdx
   salq   %cl, %rax
   orq   %rax, (%rdi,%rdx,8)
.L9:
   rep
   ret
   .cfi_endproc

I try to find a free compiler for win and come back then.

MichaelW · June 24, 2013, 09:46:49 AM

Using a gcc version 4.7.2 from a recent MinGW installation, Intel-syntax assembly is no problem, for inline assembly or for the assembly output.

Code Select


#include <stdio.h>
#include <conio.h>

double CalcArea(const double r)
{
    return r*r*3.14159;
}

int main (void)
{
    printf("%f\n", CalcArea(1));
    getch();
    __asm__("int 3\n");
}

Code Select


set file="test"
gcc -pedantic -S -masm=intel %file%.c -o %file%.asm 
pause
gcc -pedantic -masm=intel %file%.c -o %file%.exe
pause

Code Select


	.file	"test.c"
	.intel_syntax noprefix
	.text
	.globl	_CalcArea
	.def	_CalcArea;	.scl	2;	.type	32;	.endef
_CalcArea:
LFB6:
	.cfi_startproc
	push	ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	mov	ebp, esp
	.cfi_def_cfa_register 5
	sub	esp, 8
	mov	eax, DWORD PTR [ebp+8]
	mov	DWORD PTR [ebp-8], eax
	mov	eax, DWORD PTR [ebp+12]
	mov	DWORD PTR [ebp-4], eax
	fld	QWORD PTR [ebp-8]
	fmul	QWORD PTR [ebp-8]
	fld	QWORD PTR LC0
	fmulp	st(1), st
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
LFE6:
	.def	___main;	.scl	2;	.type	32;	.endef
	.section .rdata,"dr"
LC3:
	.ascii "%f\12\0"
	.text
	.globl	_main
	.def	_main;	.scl	2;	.type	32;	.endef
_main:
LFB7:
	.cfi_startproc
	push	ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	mov	ebp, esp
	.cfi_def_cfa_register 5
	and	esp, -16
	sub	esp, 16
	call	___main
	mov	eax, 0
	mov	edx, 1072693248
	mov	DWORD PTR [esp], eax
	mov	DWORD PTR [esp+4], edx
	call	_CalcArea
	fstp	QWORD PTR [esp+4]
	mov	DWORD PTR [esp], OFFSET FLAT:LC3
	call	_printf
	call	_getch
/APP
 # 13 "test.c" 1
	int 3

 # 0 "" 2
/NO_APP
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
LFE7:
	.section .rdata,"dr"
	.align 8
LC0:
	.long	-266631570
	.long	1074340345
	.def	_printf;	.scl	2;	.type	32;	.endef
	.def	_getch;	.scl	2;	.type	32;	.endef

Edit:

Somehow I missed your attachment. I added code to time the doWork procedure (Windows code that calls GetTickCount, sorry no time to do better) . The attachment contains the assembly output and EXEs for no optimization, -O1, -O2, -O3, and -Ofast. On both of my systems -O1 was the fastest.

The MASM Forum

News:

The fastest way to fill a dword array with string values

flaschenpost

dedndave

qWord

dedndave

Antariy

FORTRANS

jj2007

dedndave

flaschenpost

jj2007

flaschenpost

dedndave

jj2007

flaschenpost

MichaelW