News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

The fastest way to fill a dword array with string values

Started by frktons, December 09, 2012, 02:49:23 AM

Previous topic - Next topic

flaschenpost

I have to admit that I am not able to produce masm / jwasm - compatible assembly from my C-Code, I did not know how many different text formats for the same simple opcodes one could invent  :( . I thought "Intel-format" would do the job. I use Intel C Compiler icc to optimize and I get factor 2 with converting to a lookup-table _and_ converting from a while-loop to a for-loop, which surprised me.

Since I can not find an assembler for the output of Intel-C-Compiler icc I can not produce dos-executables.

I begin to understand why a forum like this needs to be specific about OS and Tools.

I attached the current state of c and asm files. tn is "naive", tl is "looped".

The main problem stayes:

1. I can either produce simple loops, good for optimization and vectorization, but then I have to go through the memory more often

or

2. I can try to be memory-local (Cachelines etc), but then the loops get complicated.

From my understanding of processors it should be fastest to:

- fetch a cache line from memory as big as possible in a read

- write into that memory segment with many of those bit-settings (5 and 7 resp. every 10th and every 14th bit were just the examples of course, the steps get bigger up to 1e16)

- flush that memory segment back to main mem, i.e. tell the processor that line would not be needed again in near future.

Or, alternatively:

- fill the SSE-Registers with the pattern

- write directly from SSE with "or" into main mem if that is possible.

So some questions: Is it possible to rotate the 265 bit of a AVX - register or the 128 of a SSSE3 - register? Is it even possible to fetch all the SSE/AVX-registers from one memory region in one step, and to write them together? Can I bit-shift or bit-rotate through those registers in a small number of cycles?

If I profile my code with valgrind, then the main time is needed for memory access (surprise, surprise!). But if I try to reduce the loops over memory by processing packets of memory together, then I get slower than the naive solution.

The question of the solution with a small number of cycles seems to be almost as complex as the nowadays processors themselves.

@JJ2007: I was slightly misunderstood - I did not want you to compile with gcc, but I tried to make my gcc produce a "good" assembly file (.s in the flasch0.zip). What toolstack do you use when you have a peace of C-code to optimize? Or do you really write the whole software directly with in assembly and with your own libs?

I have even blown the dust off my old Windows- Partition (XP) and tried a VC++ Express, but got not too far into it (some trouble with a precompiled header that does not work and that I can not really get rid of).

The funny thing is: version tn.c is slower than tl.c in that pure form, but when I try to pack it (even only for p=5 and p=7) into the bigger loop, it makes the whole process slower instead of faster.

I am still frustrated that involving the 16 "machines" of my Nvidia Cuda did not speedup the marking of a rhythm of bits. Probably I have some big mistakes with those memory levels in Cuda.

I think using 4 threads for 4 parts of memory should use the 2 cores รก 2 hyperthreads and speedup things, but maybe memory read/write is the bottleneck and threading makes it all worse.

Many thanks for the suggenstions and discussions so far, and I hope that the question will bring some nice thoughts from the playful experts here!

If any questions arise, I will try to answer them...

hool

is this 128bit sequence correct? every 10th bit starting at bit 7?
0'0000001'0000000001'0000000001'0000000001'00 00000001'0000000001'0000000001'0000000001'0000000001'0000000001'0000000001'0000000001'0000000001

byte0=1
byte1=0
byte2=64
byte3=16
byte4=4
byte5=1

jj2007

Quote from: flaschenpost on June 25, 2013, 06:20:29 AM
@JJ2007: I was slightly misunderstood - I did not want you to compile with gcc, but I tried to make my gcc produce a "good" assembly file (.s in the flasch0.zip). What toolstack do you use when you have a peace of C-code to optimize? Or do you really write the whole software directly with in assembly and with your own libs?

My mistake - I should have looked more carefully at the .s file...
This seems to be the innermost loop:
.L11:
   mov   r9, rsi
   mov   ecx, esi
   mov   r11, r10
   shr   r9, 6
   sal   r11, cl
   mov   ecx, edx
   or   QWORD PTR [rdi+r9*8], r11
   mov   r9, rdx
   mov   r11, r10
   shr   r9, 6
   add   rdx, r8
   sal   r11, cl
   add   rsi, r8
   or   QWORD PTR [rdi+r9*8], r11
   cmp   rdx, rax
   jbe   .L11


That looks slow, honestly. I am not so proficient in 64-bit assembly, but in general we try to avoid shifts because they are slow, apart from other design problems in that loop.

Re toolstack: Masm32 is a library of very fast routines, compared to standard CRT; MasmBasic is another library with often much faster algos, and for specific tasks you will find many more in the "Laboratory" section. And yes we do write every algo from scratch, and we all find it particularly amusing to beat a CRT algo by a factor 3 and higher :icon_mrgreen:

flaschenpost

Quote from: jj2007 on June 25, 2013, 06:59:23 AM

That looks slow, honestly. I am not so proficient in 64-bit assembly, but in general we try to avoid shifts because they are slow, apart from other design problems in that loop.

Well, for setting a certain bit in a byte I can't see an alternative. Following http://gmplib.org/~tege/x86-timing.pdf seemed not too bad for shifts, but they do not measure quad word operations.

Do you see a better translation of:
   
    while(k <= limit/2){
        C[j / 64] |= (1L << (j % 64));
        C[k / 64] |= (1L << (k % 64));
        j +=2*p;
        k+=2*p;
    }
    if(j < limit/2){
        C[j / SW] |= ((ST)1 << (j % SW));
    }


Since the values in the lookup table are only right shifted values of a starting entry I even thought about replacing the lookup table (higher value for p gets big tables) with shift operations. But you sound like a lookup in a local long[120] is faster than shifting the same value.

Quote from: jj2007 on June 25, 2013, 06:59:23 AM
Re toolstack: Masm32 is a library of very fast routines, compared to standard CRT; MasmBasic is another library with often much faster algos, and for specific tasks you will find many more in the "Laboratory" section. And yes we do write every algo from scratch, and we all find it particularly amusing to beat a CRT algo by a factor 3 and higher :icon_mrgreen:

Wow, sounds like a nice game. :)

I think for an optimal prime number sieve there would be needed techniques like in Blas/Atlas: they measure some characteristics of the processor and system and then create an optimizer profile for just that system, so maximizing cache usage, parallelization etc. What impressed my was at the early 2000s, there were really expensive BLAS-libs out there from the vendors like Sun and IBM, hand-optimized for their system, but ATLAS made it as a general package right into the same range of speed.

But nobody really needs a fast prime sieve, so it is not worth that amount of work. ;-)

The strange thing about your "shift is slow" is that replacing a lot of shifts just with the predefined pattern makes "only" factor 2.

The (short) preparation in the fastest compiled version right now is

# Begin ASM
# prepare patterns
# End ASM
                                # LOE rbx rbp r12 r13 r14 r15
..B2.16:                        # Preds ..B2.2
        movq      %r14, %rax                                    #39.21
        shlq      $6, %rax                                      #39.21
        addq      $64, %rax                                     #39.21
        cmpq      %rax, %rbp                                    #39.21
        jae       ..B2.6        # Prob 10%                      #39.21
                                # LOE rax rbx rbp r12 r13 r14 r15
..B2.4:                         # Preds ..B2.16 ..B2.4
        movq      %r12, %rsi                                    #40.15
        movq      %rbp, %r8                                     #43.15
        shrq      $6, %rsi                                      #40.15
        movl      %r12d, %ecx                                   #40.37
        shrq      $6, %r8                                       #43.15
        movl      $1, %edx                                      #40.37
        shlq      %cl, %rdx                                     #40.37
        movl      %ebp, %ecx                                    #43.37
        movl      $1, %edi                                      #43.37
        lea       (%rbp,%r14,2), %rbp                           #45.9
        orq       %rdx, (%r15,%rsi,8)                           #40.9
        lea       (%r12,%r14,2), %r12                           #42.9
        orq       %rdx, (%rsp,%rsi,8)                           #41.9
        shlq      %cl, %rdi                                     #43.37
        orq       %rdi, (%r15,%r8,8)                            #43.9
        orq       %rdi, (%rsp,%r8,8)                            #44.9
        cmpq      %rax, %rbp                                    #39.21
        jb        ..B2.4        # Prob 95%                      #39.21
                                # LOE rax rbx rbp r12 r13 r14 r15
..B2.6:                         # Preds ..B2.4 ..B2.16
        cmpq      %rax, %r12                                    #47.18
        jae       ..B2.8        # Prob 50%                      #47.18
                                # LOE rbx r12 r13 r14 r15
..B2.7:                         # Preds ..B2.6
        movq      %r12, %rdx                                    #49.15
        movl      %r12d, %ecx                                   #49.37
        shrq      $6, %rdx                                      #49.15
        movl      $1, %eax                                      #49.37
        shlq      %cl, %rax                                     #49.37
        orq       %rax, (%r15,%rdx,8)                           #49.9
        orq       %rax, (%rsp,%rdx,8)                           #50.9
                                # LOE rbx r13 r14 r15
..B2.8:                         # Preds ..B2.6 ..B2.7
        movq      (%rsp,%r14,8), %rax                           #53.17
        movq      %rax, (%rsp)                                  #53.5
        shrq      $7, %r13                                      #55.22
                                # LOE rbx r13 r14 r15
..B2.18:                        # Preds ..B2.8


and the real loop is simply a looped memcopy:

# Begin ASM
# main loop
# End ASM
                                # LOE rbx r13 r14 r15
..B2.17:                        # Preds ..B2.18
        movq      %r14, %rcx                                    #57.9
        cmpq      %r13, %r14                                    #57.22
        jae       ..B2.12       # Prob 10%                      #57.22
                                # LOE rcx rbx r13 r14 r15
..B2.10:                        # Preds ..B2.17 ..B2.10
        movq      %rcx, %rax                                    #59.30
        xorl      %edx, %edx                                    #59.30
        divq      %r14                                          #59.30
        movq      (%rsp,%rdx,8), %rbp                           #59.19
        orq       %rbp, (%r15,%rcx,8)                           #59.9
        addq      $2, %rcx                                      #60.9
        cmpq      %r13, %rcx                                    #57.22
        jb        ..B2.10       # Prob 95%                      #57.22
                                # LOE rcx rbx r13 r14 r15
..B2.12:                        # Preds ..B2.10 ..B2.17
# Begin ASM
# end of main loop
# End ASM


For the higher primes (>64) this makes no sense, because there would be write accesses not on every DWORD in the naive solution.

flaschenpost

Well, my recent version comes down to a memcpy-like operation to logical combine a memory range out in RAM with a small memory from stack or something like that. So instead of

memcpy(C + offset, local, count)

I would need a

mem_or(C + offset, local, count)

to replace the long for-loop.

Is here something like that? Did you even speedup memcpy or memfill?

AVX is out of the race, it can't bit-shift. Otherwise it would have been _really_ great.

Even AVX2 seems not be able to bit-rotate the whole 256 bits - what a pity for me... If that would have been built in, I could store the unshifted bitpatterns for 15 primes in 256-bit-length YMM0 .. YMM14, rotate the the register to the right place and "OR" it with and into YMM15. That would mean 256 Bit in a few clock cycles for the first 15 primes, plus the shift and the modulo-logic around for 15 tinyints.

But extracting the register, shifting dword by dword with parity, restoring it and _then_ use the 256-bit-OR seems really silly.


dedndave

SSE should be the way to go, i would think

one problem is the way you start out
Quote* to mark every 10th bit starting with bits 7 and 10 in a GB of memory. (7,10,17,20,27,30,...)
* also to mark every 14th bit starting with bits 15 and 24 (15,24, 29, 38, ...)

let's make it fast and say
Quote* to mark every 10th bit starting with bits 7 and 0 in a GB of memory. (7,0,17,10,27,20,...)
* also to mark every 14th bit starting with bits 1 and 10 (1,10,15,24, 29, 38, ...)
* at the end, go back and correct the first few bytes :)

this is the OWORD bit pattern for the first line of description, shown in little-endian form
10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001
00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000
01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000
00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010
00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100 10000001 00100000 01001000 00010010 00000100

the pattern repeats after the 5th OWORD

the pattern for the second line of description repeats after the 7th OWORD
make 2 tables, one with 5 OWORD's, one with 7 OWORD's
POR them together and write
adjust the 3 pointers (one for table 1, one for table 2, one for the destination)
rinse and repeat
when you're done, go back and straighten out the first few bytes

it would seem to me that the entire array is, then, a repeat of the first 35 OWORD's   :biggrin:
so - it could be even faster
just create the first 280 bytes, and copy it until you are happy - lol

a little faster, still....
create the first 280 bytes
copy it several times to make a larger block
copy the larger block several times until the gig is full

flaschenpost

@dedndave: yeah, that was what I tried in C but it made my code not faster, but only more complicated (maybe because I have no processor and asm experience).  :(  Thanks anyway for strenghening my mood in that direction!

The critical part seems to be the "OR" of two memory blocks, since 5 and 7 were only the starting point, and when I stir in the 11, 13 and 17 into the soup the blocks get really large (and since they are prime numbers, alignment can be achieved only by another 64 in the list), so it has to get a fast solution to "OR" a local memory block with the gig out there.

I read a quite strange thread about 64-bit and xmemcopy, so I will try to adopt ideas from there to OR blocks. I have made a big improvement (in C) by unrolling the loop and doing 128 long-OR in a row.

So still needed:

  • optimized code to memfill blocks of data (cache? stack?)
  • optimized code to set several bits through a big memory region (cache? registers? stack?

I try to learn enough assembler to write any working code!

jj2007

You basically need one read-only memory block that or's the two 10 and 14 bit sequences (I have started the exercise below). You then combine them into a single block with 7*5*8 =280 bytes.
272 of them can be accessed in 16-byte steps, then you need an 8-byte write. Or you choose a 560 read area to access all of them in 16-byte steps.

Speed depends then on how much of that read memory you can squeeze into registers - xmm or ymm. With 8 xmm regs, you can cover 7*16=112 bytes, the rest must be read from memory with remaining xmm. With ymm regs, coverage gets better but still you can't arrive at 560 bytes.

After that, it's simply or'ing the Gigabyte with a battery of registers. There is a MOVNTDQ (Move Double Quadword Non-Temporal) instruction that is worth studying, but afaik no similar instruction is available for or'ing memory.

HTH

.data
        ;x 10987654321098765432109876543210___32109876543210987654321098765432___54321098765432109876543210987654
Bits10        dd 01001000000100100000010010000000b, 00010010000001001000000100100000b, 00000100100000010010000001001000b

        ;x 76543210987654321098765432109876___98765432109876543210987654321098
        dd 10000001001000000100100000010010b, 00100000010010000001001000000100b

        ;x 10987654321098765432109876543210___32109876543210987654321098765432___54321098765432109876543210987654
Bits14        dd 00100001000000001000000000000000b, 00000000000000000000100001000000b, 00000000000000000000000000000000b

        ;x 76543210987654321098765432109876___98765432109876543210987654321098
        dd 00000000000000000000000000000000b, 00000000000000000000000000000000b

        ;x 76543210987654321098765432109876___98765432109876543210987654321098
        dd 00000000000000000000000000000000b, 00000000000000000000000000000000b

        ;x 76543210987654321098765432109876___98765432109876543210987654321098
        dd 00000000000000000000000000000000b, 00000000000000000000000000000000b

dedndave

i guess you are a member of Euler ?

what i had in mind was to generate the data once and store it in a file
then - read the file into memory - lol

no need to generate the table every time you want to use it

flaschenpost

Quote from: dedndave on June 27, 2013, 10:47:43 AM
i guess you are a member of Euler ?

Well, yes that also is true, but this one was a StackOverflow question (about parallelizing, I guess it was a student's homework)  triggering my old enthusiasm.

I just started my old program for a 1<<32 Maximum and sat there waiting for marking of number "3", and waited and sat, and waited...

But from that point on the search for a highly efficient solution got its own game.

I think I am still in the range of calculating a GB not much slower than reading it, apart from the disc space saved. ;-)

But in the end it should be a streaming solution that writes as many GB as I want.

Once and For All. ;)

hool

considering I didn't mess up the table:
        align 16
every10th_at7:
        ;-----------------  prolog:
        db 00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b
        db 00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b
        ;-----------------  main loop starts:
        db 00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b
        db 00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b
        db 01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b
        db 00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b
        db 00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b
        ; ----------------  unroll for a 16byte multiples writes:
        db 00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b
        db 00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b
        db 01000000b,00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b
        db 00000001b,00000000b,01000000b,00010000b,00000100b,00000001b,00000000b,01000000b
        db 00010000b,00000100b,00000001b,00000000b,01000000b,00010000b,00000100b,00000001b



        mov     rsi, _buff1

        ; rsi points to a buffer

        mov     rax, [every10th_at7]
        mov     rdx, [every10th_at7+8]
        or      [rsi], rax
        or      [rsi+8], rdx
        add     rsi, 16

        mov     rax, [every10th_at7+16]
        mov     rdx, [every10th_at7+24]
        mov     rdi, [every10th_at7+32]
        mov     rbx, [every10th_at7+40]
        mov     rbp, [every10th_at7+48]

        ; number of 40byte blocks
        mov     ecx, 10
@@:
        or     [rsi], rax
        or     [rsi+8], rdx
        or     [rsi+16], rdi
        or     [rsi+24], rbx
        or     [rsi+32], rbp
        add    rsi, 40
        dec    ecx
        jnz    @b


solution 2, ilustrates limitation of "x86 32bit mode" and  "POR"

        mov     esi, _buff1
        ; esi points to 16 byte aligned buffer

        mov     eax, [every10th_at7]
        mov     edx, [every10th_at7+4]
        mov     ecx, [every10th_at7+8]
        mov     edi, [every10th_at7+12]
        or      [esi], eax
        or      [esi+4], edx
        or      [esi+8], ecx
        or      [esi+12], edi
        add     esi, 16

        movdqa  xmm0, [every10th_at7+16]
        movdqa  xmm1, [every10th_at7+32]
        movdqa  xmm2, [every10th_at7+48]
        movdqa  xmm3, [every10th_at7+64]
        mov     eax, [every10th_at7+80]
        mov     edx, [every10th_at7+84]
        mov     ebx, [every10th_at7+88]
        mov     edi, [every10th_at7+92]


        ; number of 80 byte blocks
        mov     ecx, 5
@@:
        movdqa  xmm4, xmm0
        movdqa  xmm5, xmm1
        movdqa  xmm6, xmm2
        movdqa  xmm7, xmm3
        por     xmm4, [esi]
        por     xmm5, [esi+16]
        por     xmm6, [esi+32]
        por     xmm7, [esi+48]
        or      [esi+64], eax
        or      [esi+68], edx
        or      [esi+72], ebx
        or      [esi+76], edi
        movdqa  [esi], xmm4
        movdqa  [esi+16], xmm5
        movdqa  [esi+32], xmm6
        movdqa  [esi+48], xmm7
        add     esi, 80
        dec     ecx
        jnz     @b
                     


EDIT, this code had incorrect offsets

mov     rax, [every10th_at7]
        mov     rdx, [every10th_at7+8]
        mov     rdi, [every10th_at7+16]
        mov     rbx, [every10th_at7+24]
        mov     rbp, [every10th_at7+32]