News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

The fastest way to fill a dword array with string values

Started by frktons, December 09, 2012, 02:49:23 AM

Previous topic - Next topic

flaschenpost

Hi, sorry for bumping in - and being late.

I stumbled over that wonderful thread because I was unhappy with the timing of a simple prime number generator. No math   8) or pun intended, the slow part can be simplified for the show as:

* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...)

That classical C-Code ( bytelist[j/64] |= 1UL<<(j%64) ) seems to take quite some time, I come around 0.04s on a  Intel(R) Core(TM)2 Duo CPU  T5750  @ 2.00GHz. Maybe that is optimized, and I am hunting a pink unicorn.

I do not know asm yet, I do not yet use masm. All I have is that feeling that for setting every 10th bit should be a smarter way than those presented by gcc or clang.

It's just for fun. I even tried parallelisation on the Cuda side of life (in C), but I seem to make all of those terrible mistakes of the unblessed beginners - it gets much slower than the same partitioned code on plain CPU. One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

I think SSE3 or SSSE3 should make it easy, and even IntelHD on the other computer (Core i5) should have enough cpu hardware units to set a few bits in parallel.

dedndave

i would try to get away from multiple threads

there may be a faster way in SSE - i am not familiar with SSE shifts and rotates
but, you might think of something along this line

    mov     eax,ecx  ;every 10th bit from ECX
    mov     edx,ebp  ;every 10th bit from EBP
    or      eax,esi  ;every 14th bit from ESI
    or      edx,ebx  ;every 14th bit from EBX
    mov     [edi],eax
    mov     [edi+4],edx
;do rotates here
    add     edi,8
;loop to top

qWord

If you can determine the period of the patterns (related to 128 bit blocks), it should be really simple to OR combine the pre-calculated patterns with the target bytes (e.g. using the SSE2 instr. POR).
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

oh yah - a look-up table could have the patterned data

Antariy

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
It's just for fun. I even tried parallelisation on the Cuda side of life (in C), but I seem to make all of those terrible mistakes of the unblessed beginners - it gets much slower than the same partitioned code on plain CPU. One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

I think SSE3 or SSSE3 should make it easy, and even IntelHD on the other computer (Core i5) should have enough cpu hardware units to set a few bits in parallel.

Probably that's because Cuda (and SSE) are designed for really simple number crunching commands, not for the stuff like bits manipulation. I.e., they work well with float/probably integer datatypes, which can be processed well in parallel - image processing etc, but will probably slower than CPU in things like bits bits manipulation. Maybe the solution for the speed up on the CPU is to parallelise bits-in-bytes manipulation - i.e., prepare bytes of an array independedly, step by step and these steps should be intermixed, like "work with bit 1 of byte 1, work with bit 1 of byte 2, work with bit 1 of byte 3" then "work with bit 2 of byte 1, work with bit 2 of byte 2, work with bit 2 of byte 3" and so on, then put the combined bytes into array.

(There was is just an assumption about CUDA, did not work with it at all really).

FORTRANS

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
One pitfall is parallel writes to the same word on different bits from different threads - it overwrites other set bits (setting the 20th bit needs loading the word, setting the bit, saving the word - so the recently stored bit 15 gets cleared again)

Hi,

   Well in assembly you need not load, modify, and store.  You can
modify the memory directly.  And that would avoid resetting any
set bits.


        MOV     EAX,[Memory]    ; Load
        OR      EAX,BitPattern  ; Modify
        MOV     [Memory],EAX    ; Store

; can become
        OR      DWORD PTR [Memory],BitPattern


Regards,

Steve N.

jj2007

Quote from: flaschenpost on June 22, 2013, 10:58:12 AM
* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...

As qWord already mentioned, find the right periods and go ahead. For "every 10th bit", 80 bytes is the period - 5*16, or 2*32+1*16. Every 14th bit is trickier, of course - 8*14=?

dedndave

i'm not sure i understand the patterns - lol

7, 17, 29 are primes
10, 15, 20, 24, 27, 30, 38 are not


flaschenpost

Thanks for all the suggestions, I also thought about using the lookup-tables. They sound great for 2*5 and 2*7, but I am suspicious about the bytewise operation (is there some 8-bit logik still around?), so it might get to 2*5*64 bit and 2*7*64 bit for the first two numbers, and then up it goes (2*11*64, 2*13*64). At least in my C - proof of concept they are much slower than the (gcc-optimized) naive solutions.

@dedndave: well, the trivial non-primes I just won't store, so I don't care about even numbers or multiples of 3. The list simply gets

5 7  11 13  17 19  23 25  29 31  35 37 ...
0 1   2  3   4  5   6  7   8  9  10 11 ...


The marking of all numbers devidible by 5 is marking every 10th bit starting from 7 (which is 25) and from 10 (which is 35). You see, 25 + 5 makes 30 and is even, so no need to check that. 35 is marked with bit 10, 55 with bit 17.

But maybe I am really near the possible optimum. Coming from all the high-level-languages I "thought" there would be some magic working with bit-Stream-like sources and combining them in big chunks with "OR".

I definitely will play with the SSE-Registers.

jj2007

Not sure if SSE will bring the big improvement, but...

You have two overlaid loops to handle, step 10 and step 14 bits. For speed, you would need to use (many) registers to set the bits and minimise mem-to-mem operations. It may look roughly like this in the end:

Entry:
  xor eax, eax  ; same as mov eax, 0
  xor ebx, ebx
  xor ecx, ecx
...
  or reg32, whatever  ; set bits for step 10
...
  or reg32, whatever  ; set bits for step 14
...
  mov [edi], eax  ; write regs to mem
  mov [edi+4], ebx
  mov [edi+8], ecx
  mov [edi+12], edx

Now it would be great if you could post the fast gcc solution, with an __asm int 3; on entry to the innermost loop, so that we could check at assembly level what makes it fast. As a rule of thumb, hand-crafted assembly can be at least a factor 2 faster.

flaschenpost

@jj2007: here the code, I tried gcc with option -masm=intel.

Unfortunately __asm int 3; does not work with my linux-gcc. I inserted it manually at line 82, function doWork, before loop .L11 - but forgot after recompile  ::)

dedndave

SSE may not help much, because memory accesses are limited by hardware
but - it may help a little, because you can write 128 bits at a whack

the idea would be to make a pattern table for one set and a pattern table for the other set
then, use SSE code to OR the appropriate pattern values together, then store them

even so.....
i think you said your current code takes 40 mS to write a GB
that's not too bad, really   :P

jj2007

Quote from: flaschenpost on June 23, 2013, 10:13:28 PM
@jj2007: here the code, I tried gcc with option -masm=intel.

I don't have gcc. Can you post the executable, please?

flaschenpost

There is a tn.s inside the zip. I hoped that would be compatible, but jwasm does not accept it.

Does anybody know how to compile C - Code into assembly for masm? The relevant part looks like

doWork:
.LFB64:
   .cfi_startproc
   movq   %rcx, %rax
   shrq   %rax
   cmpq   %rax, %rdx
   ja   .L10
   addq   %r8, %r8
   movl   $1, %r10d
   .p2align 4,,10
   .p2align 3
.L11:
   movq   %rsi, %r9
   movl   %esi, %ecx
   movq   %r10, %r11
   shrq   $6, %r9
   salq   %cl, %r11
   movl   %edx, %ecx
   orq   %r11, (%rdi,%r9,8)
   movq   %rdx, %r9
   movq   %r10, %r11
   shrq   $6, %r9
   addq   %r8, %rdx
   salq   %cl, %r11
   addq   %r8, %rsi
   orq   %r11, (%rdi,%r9,8)
   cmpq   %rax, %rdx
   jbe   .L11
.L10:
   cmpq   %rax, %rsi
   jae   .L9
   movq   %rsi, %rdx
   movl   $1, %eax
   movl   %esi, %ecx
   shrq   $6, %rdx
   salq   %cl, %rax
   orq   %rax, (%rdi,%rdx,8)
.L9:
   rep
   ret
   .cfi_endproc

I try to find a free compiler for win and come back then.

MichaelW

#104
Using a gcc version 4.7.2 from a recent MinGW installation, Intel-syntax assembly is no problem, for inline assembly or for the assembly output.

#include <stdio.h>
#include <conio.h>

double CalcArea(const double r)
{
    return r*r*3.14159;
}

int main (void)
{
    printf("%f\n", CalcArea(1));
    getch();
    __asm__("int 3\n");
}


set file="test"
gcc -pedantic -S -masm=intel %file%.c -o %file%.asm
pause
gcc -pedantic -masm=intel %file%.c -o %file%.exe
pause


.file "test.c"
.intel_syntax noprefix
.text
.globl _CalcArea
.def _CalcArea; .scl 2; .type 32; .endef
_CalcArea:
LFB6:
.cfi_startproc
push ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
mov ebp, esp
.cfi_def_cfa_register 5
sub esp, 8
mov eax, DWORD PTR [ebp+8]
mov DWORD PTR [ebp-8], eax
mov eax, DWORD PTR [ebp+12]
mov DWORD PTR [ebp-4], eax
fld QWORD PTR [ebp-8]
fmul QWORD PTR [ebp-8]
fld QWORD PTR LC0
fmulp st(1), st
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE6:
.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC3:
.ascii "%f\12\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB7:
.cfi_startproc
push ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
mov ebp, esp
.cfi_def_cfa_register 5
and esp, -16
sub esp, 16
call ___main
mov eax, 0
mov edx, 1072693248
mov DWORD PTR [esp], eax
mov DWORD PTR [esp+4], edx
call _CalcArea
fstp QWORD PTR [esp+4]
mov DWORD PTR [esp], OFFSET FLAT:LC3
call _printf
call _getch
/APP
# 13 "test.c" 1
int 3

# 0 "" 2
/NO_APP
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE7:
.section .rdata,"dr"
.align 8
LC0:
.long -266631570
.long 1074340345
.def _printf; .scl 2; .type 32; .endef
.def _getch; .scl 2; .type 32; .endef



Edit:

Somehow I missed your attachment. I added code to time the doWork procedure (Windows code that calls GetTickCount, sorry no time to do better) . The attachment contains the assembly output and EXEs for no optimization, -O1, -O2, -O3, and -Ofast. On both of my systems -O1 was the fastest.

Well Microsoft, here's another nice mess you've gotten us into.