Multi thread rdrand test piece

hutch-- · June 12, 2019, 05:17:56 AM

This test piece was designed to run 4 threads with the same code to create 4 buffers filled with rdrand random numbers that are saves to disk sequentially to make up the total size. Over an earlier test piece using the same loop code the 4 thread version clocks about 4 times faster which is surprising in that you usually don't get 4 times the speed with 4 threads. Results are in the good quality range and the test piece automatically runs ENT to see if the results are any good.

With "rdrand" there is some lingering suspicion that security agencies may have forced Intel to put a back door in the hardware so using this technique would need another random algo to combined with the rdrand version but the current style of output is good quality random and if you have need of random pads other than for security, this will work well.

Note that the file IO is very ordinary but the random generation is reasonably fast.

LiaoMi · June 12, 2019, 08:12:18 PM

Hi Hutch,

my results...

Code Select

Creating pad
Thread 1 here
Thread 2 here
Thread 3 here
Thread 4 here
Writing pad to disk
File written to disk at 1024 megabytes
Analysing output
Wait for the result
Entropy = 8.000000 bits per byte.

Optimum compression would reduce the size
of this 1073741824 byte file by 0 percent.

Chi square distribution for 1073741824 samples is 297.57, and randomly
would exceed this value 5.00 percent of the times.

Arithmetic mean value of data bytes is 127.5028 (127.5 = random).
Monte Carlo value for Pi is 3.141555604 (error 0.00 percent).
Serial correlation coefficient is -0.000020 (totally uncorrelated = 0.0).

aw27 · June 12, 2019, 10:03:40 PM

This is a variation with 64 threads, using WaitForMultipleObjects.
Incidentally, I could not make conout to print properly from the threads (I did not found the reason), so I ended up using vc_printf.

hutch-- · June 13, 2019, 12:32:43 AM

That worked well Jose, added timing to it and its about 30 - 35% faster than the 4 thread version. When I get a bit further along I will play with a higher thread count, the example was aimed at 4 cores as that what most people are running at the moment.

hutch-- · June 13, 2019, 11:32:59 AM

This is the next version, primarily testing the increase thread count and how it effects timing. I have stuck with the spinlock to avoid the 64 thread limit and have tested 128 and 256 threads but it does not get faster than the 64 thread count. Testing over the thread count range, 4 threads runs at about the speed of my earlier test piece, 8 is a lot faster and you get incremental gains up to 64 threads.

Older hardware may not be happy with high thread counts but should handle 16 or 32 threads.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include64\masm64rt.inc

.data
flag dq ? ; completion flag
bsiz dq ? ; buffer size
blsz dq ? ; block size

.code

tcnt equ <64> ; <8> <16> <32> <64> <128> <256> ; thread count

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

USING r12,r13,r14,r15
LOCAL hFile :QWORD
LOCAL bwrt :QWORD

SaveRegs ; save volatile registers

mov flag, 0 ; zero the completion flag
mov bsiz, 1024*1024*1024 ; allocated size
mov blsz, 1024*1024*1024/tcnt ; individual block size

conout "Creating random pad",lf

mov r15, alloc(bsiz) ; allocate single block

mov r14, rvcall(GetTickCount) ; start the timing

mov r12, tcnt ; tcnt thread count
mov r13, r15
@@:
invoke CreateThread,0,0,ptr$(Thread),r13,0,0
rcall CloseHandle, rax
add r13, blsz ; set next write location
sub r12, 1
jnz @B

spinlock: ; poll for completion count
cmp flag, tcnt
jne spinlock

rcall GetTickCount
sub rax, r14
conout "Timing = ",str$(rax)," ms",lf ; display timing results

conout "Thread Completion Count = ",str$(flag),lf
conout "Saving file to disk",lf
mov bwrt, savefile("test.pad",r15,bsiz) ; write result to disk

exec "ent test.pad"

waitkey "Wait for ENT to complete",lf
RestoreRegs ; restor volatile registers
.exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

Thread proc

mov rdx, blsz ; load the block size

@@:
rdrand rax
mov [rcx], rax ; rcx is the buffer address
add rcx, 8
sub rdx, 8
jnz @B

add flag, 1 ; increment the completion flag
ret

Thread endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

aw27 · June 13, 2019, 01:54:19 PM

Probably, in this particular scenario having more threads than the number of physical cores (not logical cores) will not bring any real advantage.
For most people it is 4 or 6 physical cores, up to 32 for people with AMD Threadripper.

However, this is not always the case in particular, when threads stale waiting for a resource.

aw27 · June 13, 2019, 06:21:10 PM

I want to clarify my previous statement, which apparently contradict Hutch' findings. It does not.

When I said that number of threads equaling the number of physical cores is probably the best we can do in this scenario, there is a question I did not answer. On a CPU with 4 physical cores/8 logical threads, when I launch a program with 4 threads how can I guarantee that each of them goes to a different physical core? I can not, some physical core(s) may receive more than 1 thread. Some people mentioned that in Windows, logical and physical cores are interleaved, if this is true we can play with affinity masks to launch a thread exactly to a different physical core. This might be an interesting exercise.

In my own experiments with Hutch's last exercise (not using complications like affinity masks, etc) the best is achieved with 11 threads, because I have 6 physical cores and the main thread is extremely busy polling. If we don't do polling, and use WaitForMultipleObjects, the best is achieved with 12 threads.

hutch-- · June 13, 2019, 06:37:13 PM

I don't see any problem in your comment, there are variables in play that are not well documented with processor hardware. It was not all that long ago that each thread would add about 90% to the first thread but the first test I did with 4 threads was 4 times faster than a single thread and I suggest it is because the hardware is getting smarter and more efficient. I also don't know what the fetch mechanism is with "rdrand" in relation to the preferred instruction set and while the reference material says the result is derived from thermal noise, it is a more complex instruction that ADD or MOV.

If rdrand is laggy as I suspect, then it means that the thread is not being fully utilised which explains why you can run so many threads. It is not uncommon with socket communication to have under-utilised threads which allows you to run many more threads without processor saturation. Sad to say you learn these things by test piece as the documentation doe not really help you here.

daydreamer · June 14, 2019, 01:57:55 AM

interesting,maybe going to try rdrand vs randomgenerator snippet
doesnt it require some avx code to fully utilize a thread/core?

hutch-- · June 14, 2019, 03:12:20 AM

No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined.

daydreamer · June 15, 2019, 01:42:21 AM

Quote from: hutch-- on June 14, 2019, 03:12:20 AM
No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined.

what I have in mind is test rdrand vs a packed SIMD randomgenerators
I havent tested this yet,not made a dword version yet or avx version

Code Select

movaps xmm7,seedq
paddw xmm7,r12345678q
pmullw xmm7,seedq
movaps seedq,xmm7

hutch-- · June 15, 2019, 02:13:18 AM

Let us know when you get it going, producing the seed for AVX will be a ton of fun. You will have to code the random algo yourself as rdrand only works up to QWORD where you need 256 bit data or 128 bit data if you settle for legacy SSE.

aw27 · June 15, 2019, 01:58:58 PM

To generate a good seed we can use use RDSEED, it has multiplicative predication resistance. We multiply 4 of them together to get a good 256 bit seed. Probably, it is better to generate 4 or 8 qword seeds in a row and feed the AVX with them.
Then we can have a lot of fun mass producing pseudo random numbers with AVX, either 1 at a time, 2, 4, 8, etc at a time. For some variations we need to combine vectors.

hutch-- · June 15, 2019, 07:00:45 PM

The only problem with rdseed is you need very later hardware to use it. I would be inclined to do multiple rdrand 64 bit results and combine them to get the unpredictability required. At the moment I am working on a 64 bit seeding algo that does not use rdrand or rdseed so it can be used on older hardware.

This is the play version at the moment. It is about as scientific at the moment as twiddling the numbers.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include64\masm64rt.inc

.code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

USING r12,r13,r14

SaveRegs

mov r12, 32

@@:
rcall seed64
mov r13, rax
mov r14, rax

conout " ",hex$(r13)," - "
conout " ",str$(r14),lf

sub r12, 1
jnz @B

waitkey
RestoreRegs
.exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

seed64 proc

rdtsc
pause
bswap rax
mov r11, rax

mov rcx, 10 ; loop count
@@:
rdtsc ; date time count
pause ; spinlock pause
bswap rax ; reverse byte order
rol rax, 7 ; rotate left by prime
xor r11, rax ; xor rax to r11
rol r11, 5
sub rcx, 1 ; decrement counter

jnz @B
mov rax, r11
ret

seed64 endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

aw27 · June 16, 2019, 12:49:33 AM

Linux uses RDRAND together with other sources of entropy and I think RDSEED is being used as well, when available, in more recent kernels.

The MASM Forum

News:

Multi thread rdrand test piece

hutch--

LiaoMi

aw27

hutch--

hutch--

aw27

aw27

hutch--

daydreamer

hutch--

daydreamer

hutch--

aw27

hutch--

aw27