News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Multi thread rdrand test piece

Started by hutch--, June 12, 2019, 05:17:56 AM

Previous topic - Next topic

hutch--

This test piece was designed to run 4 threads with the same code to create 4 buffers filled with rdrand random numbers that are saves to disk sequentially to make up the total size. Over an earlier test piece using the same loop code the 4 thread version clocks about 4 times faster which is surprising in that you usually don't get 4 times the speed with 4 threads. Results are in the good quality range and the test piece automatically runs ENT to see if the results are any good.

With "rdrand" there is some lingering suspicion that security agencies may have forced Intel to put a back door in the hardware so using this technique would need another random algo to combined with the rdrand version but the current style of output is good quality random and if you have need of random pads other than for security, this will work well.

Note that the file IO is very ordinary but the random generation is reasonably fast.

LiaoMi

Hi Hutch,

my results...

Creating pad
Thread 1 here
Thread 2 here
Thread 3 here
Thread 4 here
Writing pad to disk
File written to disk at 1024 megabytes
Analysing output
Wait for the result
Entropy = 8.000000 bits per byte.

Optimum compression would reduce the size
of this 1073741824 byte file by 0 percent.

Chi square distribution for 1073741824 samples is 297.57, and randomly
would exceed this value 5.00 percent of the times.

Arithmetic mean value of data bytes is 127.5028 (127.5 = random).
Monte Carlo value for Pi is 3.141555604 (error 0.00 percent).
Serial correlation coefficient is -0.000020 (totally uncorrelated = 0.0).

aw27

This is a variation with 64 threads, using WaitForMultipleObjects.
Incidentally, I could not make conout to print properly from the threads (I did not found the reason), so I ended up using vc_printf.

hutch--

That worked well Jose, added timing to it and its about 30 - 35% faster than the 4 thread version. When I get a bit further along I will play with a higher thread count, the example was aimed at 4 cores as that what most people are running at the moment.

hutch--

This is the next version, primarily testing the increase thread count and how it effects timing. I have stuck with the spinlock to avoid the 64 thread limit and have tested 128 and 256 threads but it does not get faster than the 64 thread count. Testing over the thread count range, 4 threads runs at about the speed of my earlier test piece, 8 is a lot faster and you get incremental gains up to 64 threads.

Older hardware may not be happy with high thread counts but should handle 16 or 32 threads.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .data
      flag dq ?                                         ; completion flag
      bsiz dq ?                                         ; buffer size
      blsz dq ?                                         ; block size

    .code

    tcnt equ <64>   ; <8> <16> <32> <64> <128> <256>    ; thread count

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12,r13,r14,r15
    LOCAL hFile :QWORD
    LOCAL bwrt  :QWORD

    SaveRegs                                            ; save volatile registers

    mov flag, 0                                         ; zero the completion flag
    mov bsiz, 1024*1024*1024                            ; allocated size
    mov blsz, 1024*1024*1024/tcnt                       ; individual block size

    conout "Creating random pad",lf

    mov r15, alloc(bsiz)                                ; allocate single block

    mov r14, rvcall(GetTickCount)                       ; start the timing

    mov r12, tcnt                                       ; tcnt thread count
    mov r13, r15
    @@:
    invoke CreateThread,0,0,ptr$(Thread),r13,0,0
    rcall CloseHandle, rax
    add r13, blsz                                       ; set next write location
    sub r12, 1
    jnz @B

    spinlock:                                           ; poll for completion count
    cmp flag, tcnt
    jne spinlock

    rcall GetTickCount
    sub rax, r14
    conout "Timing = ",str$(rax)," ms",lf               ; display timing results

    conout "Thread Completion Count = ",str$(flag),lf
    conout "Saving file to disk",lf
    mov bwrt, savefile("test.pad",r15,bsiz)             ; write result to disk

    exec "ent test.pad"

    waitkey "Wait for ENT to complete",lf
    RestoreRegs                                         ; restor volatile registers
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

Thread proc

    mov rdx, blsz                                       ; load the block size

    @@:
    rdrand rax
    mov [rcx], rax                                      ; rcx is the buffer address
    add rcx, 8
    sub rdx, 8
    jnz @B

    add flag, 1                                         ; increment the completion flag
    ret

Thread endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

aw27

Probably, in this particular scenario having more threads than the number of physical cores (not logical cores) will not bring any real advantage.
For most people it is 4 or 6 physical cores, up to 32 for people with AMD Threadripper.

However, this is not always the case in particular, when threads stale waiting for a resource.


aw27

I want to clarify my previous statement, which apparently contradict Hutch' findings. It does not.

When I said that number of threads equaling the number of physical cores is probably the best we can do in this scenario, there is a question I did not answer. On a CPU with 4 physical cores/8 logical threads, when I launch a program with 4 threads how can I guarantee that each of them goes to a different physical core? I can not, some physical core(s) may receive more than 1 thread. Some people mentioned that in Windows, logical and physical cores are interleaved, if this is true we can play with affinity masks to launch a thread exactly to a different physical core. This might be an interesting exercise.  :icon_idea:

In my own experiments with Hutch's last exercise (not using complications like affinity masks, etc)  the best is achieved with 11 threads, because I have 6 physical cores and the main thread is extremely busy polling. If we don't do polling, and use WaitForMultipleObjects, the best is achieved with 12 threads.

hutch--

I don't see any problem in your comment, there are variables in play that are not well documented with processor hardware. It was not all that long ago that each thread would add about 90% to the first thread but the first test I did with 4 threads was 4 times faster than a single thread and I suggest it is because the hardware is getting smarter and more efficient. I also don't know what the fetch mechanism is with "rdrand" in relation to the preferred instruction set and while the reference material says the result is derived from thermal noise, it is a more complex instruction that ADD or MOV.

If rdrand is laggy as I suspect, then it means that the thread is not being fully utilised which explains why you can run so many threads. It is not uncommon with socket communication to have under-utilised threads which allows you to run many more threads without processor saturation. Sad to say you learn these things by test piece as the documentation doe not really help you here.

daydreamer

interesting,maybe going to try rdrand vs randomgenerator snippet
doesnt it require some avx code to fully utilize a thread/core?



my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined. :hmmm:

daydreamer

Quote from: hutch-- on June 14, 2019, 03:12:20 AM
No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined. :hmmm:
what I have in mind is test rdrand vs a packed SIMD randomgenerators
I havent tested this yet,not made a dword version yet or avx version
movaps xmm7,seedq
paddw xmm7,r12345678q
pmullw xmm7,seedq
movaps seedq,xmm7
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

 :biggrin:

Let us know when you get it going, producing the seed for AVX will be a ton of fun. You will have to code the random algo yourself as rdrand only works up to QWORD where you need 256 bit data or 128 bit data if you settle for legacy SSE.

aw27

To generate a good seed we can use use RDSEED, it has multiplicative predication resistance. We multiply 4 of them together to get a good 256 bit seed. Probably, it is better to generate 4 or 8 qword seeds in a row and feed the AVX  with them.
Then we can have a lot of fun mass producing pseudo random numbers with AVX, either 1 at a time, 2, 4, 8, etc at a time. For some variations we need to combine vectors.

hutch--

The only problem with rdseed is you need very later hardware to use it. I would be inclined to do multiple rdrand 64 bit results and combine them to get the unpredictability required. At the moment I am working on a 64 bit seeding algo that does not use rdrand or rdseed so it can be used on older hardware.

This is the play version at the moment. It is about as scientific at the moment as twiddling the numbers.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12,r13,r14

    SaveRegs

    mov r12, 32

    @@:
    rcall seed64
    mov r13, rax
    mov r14, rax

    conout "    ",hex$(r13),"     - "
    conout "    ",str$(r14),lf

    sub r12, 1
    jnz @B

    waitkey
    RestoreRegs
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

seed64 proc

    rdtsc
    pause
    bswap rax
    mov r11, rax

    mov rcx, 10         ; loop count
  @@:
    rdtsc               ; date time count
    pause               ; spinlock pause
    bswap rax           ; reverse byte order
    rol rax, 7          ; rotate left by prime
    xor r11, rax        ; xor rax to r11
    rol r11, 5
    sub rcx, 1          ; decrement counter

    jnz @B
    mov rax, r11
    ret

seed64 endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

aw27

Linux uses RDRAND together with other sources of entropy and I think RDSEED is being used as well, when available, in more recent kernels.