The MASM Forum

Microsoft 64 bit MASM => Examples => Topic started by: hutch-- on June 12, 2019, 05:17:56 AM

Title: Multi thread rdrand test piece
Post by: hutch-- on June 12, 2019, 05:17:56 AM
This test piece was designed to run 4 threads with the same code to create 4 buffers filled with rdrand random numbers that are saves to disk sequentially to make up the total size. Over an earlier test piece using the same loop code the 4 thread version clocks about 4 times faster which is surprising in that you usually don't get 4 times the speed with 4 threads. Results are in the good quality range and the test piece automatically runs ENT to see if the results are any good.

With "rdrand" there is some lingering suspicion that security agencies may have forced Intel to put a back door in the hardware so using this technique would need another random algo to combined with the rdrand version but the current style of output is good quality random and if you have need of random pads other than for security, this will work well.

Note that the file IO is very ordinary but the random generation is reasonably fast.
Title: Re: Multi thread rdrand test piece
Post by: LiaoMi on June 12, 2019, 08:12:18 PM
Hi Hutch,

my results...

Creating pad
Thread 1 here
Thread 2 here
Thread 3 here
Thread 4 here
Writing pad to disk
File written to disk at 1024 megabytes
Analysing output
Wait for the result
Entropy = 8.000000 bits per byte.

Optimum compression would reduce the size
of this 1073741824 byte file by 0 percent.

Chi square distribution for 1073741824 samples is 297.57, and randomly
would exceed this value 5.00 percent of the times.

Arithmetic mean value of data bytes is 127.5028 (127.5 = random).
Monte Carlo value for Pi is 3.141555604 (error 0.00 percent).
Serial correlation coefficient is -0.000020 (totally uncorrelated = 0.0).
Title: Re: Multi thread rdrand test piece
Post by: aw27 on June 12, 2019, 10:03:40 PM
This is a variation with 64 threads, using WaitForMultipleObjects.
Incidentally, I could not make conout to print properly from the threads (I did not found the reason), so I ended up using vc_printf.
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 13, 2019, 12:32:43 AM
That worked well Jose, added timing to it and its about 30 - 35% faster than the 4 thread version. When I get a bit further along I will play with a higher thread count, the example was aimed at 4 cores as that what most people are running at the moment.
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 13, 2019, 11:32:59 AM
This is the next version, primarily testing the increase thread count and how it effects timing. I have stuck with the spinlock to avoid the 64 thread limit and have tested 128 and 256 threads but it does not get faster than the 64 thread count. Testing over the thread count range, 4 threads runs at about the speed of my earlier test piece, 8 is a lot faster and you get incremental gains up to 64 threads.

Older hardware may not be happy with high thread counts but should handle 16 or 32 threads.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .data
      flag dq ?                                         ; completion flag
      bsiz dq ?                                         ; buffer size
      blsz dq ?                                         ; block size

    .code

    tcnt equ <64>   ; <8> <16> <32> <64> <128> <256>    ; thread count

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12,r13,r14,r15
    LOCAL hFile :QWORD
    LOCAL bwrt  :QWORD

    SaveRegs                                            ; save volatile registers

    mov flag, 0                                         ; zero the completion flag
    mov bsiz, 1024*1024*1024                            ; allocated size
    mov blsz, 1024*1024*1024/tcnt                       ; individual block size

    conout "Creating random pad",lf

    mov r15, alloc(bsiz)                                ; allocate single block

    mov r14, rvcall(GetTickCount)                       ; start the timing

    mov r12, tcnt                                       ; tcnt thread count
    mov r13, r15
    @@:
    invoke CreateThread,0,0,ptr$(Thread),r13,0,0
    rcall CloseHandle, rax
    add r13, blsz                                       ; set next write location
    sub r12, 1
    jnz @B

    spinlock:                                           ; poll for completion count
    cmp flag, tcnt
    jne spinlock

    rcall GetTickCount
    sub rax, r14
    conout "Timing = ",str$(rax)," ms",lf               ; display timing results

    conout "Thread Completion Count = ",str$(flag),lf
    conout "Saving file to disk",lf
    mov bwrt, savefile("test.pad",r15,bsiz)             ; write result to disk

    exec "ent test.pad"

    waitkey "Wait for ENT to complete",lf
    RestoreRegs                                         ; restor volatile registers
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

Thread proc

    mov rdx, blsz                                       ; load the block size

    @@:
    rdrand rax
    mov [rcx], rax                                      ; rcx is the buffer address
    add rcx, 8
    sub rdx, 8
    jnz @B

    add flag, 1                                         ; increment the completion flag
    ret

Thread endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end
Title: Re: Multi thread rdrand test piece
Post by: aw27 on June 13, 2019, 01:54:19 PM
Probably, in this particular scenario having more threads than the number of physical cores (not logical cores) will not bring any real advantage.
For most people it is 4 or 6 physical cores, up to 32 for people with AMD Threadripper.

However, this is not always the case in particular, when threads stale waiting for a resource.

Title: Re: Multi thread rdrand test piece
Post by: aw27 on June 13, 2019, 06:21:10 PM
I want to clarify my previous statement, which apparently contradict Hutch' findings. It does not.

When I said that number of threads equaling the number of physical cores is probably the best we can do in this scenario, there is a question I did not answer. On a CPU with 4 physical cores/8 logical threads, when I launch a program with 4 threads how can I guarantee that each of them goes to a different physical core? I can not, some physical core(s) may receive more than 1 thread. Some people mentioned that in Windows, logical and physical cores are interleaved, if this is true we can play with affinity masks to launch a thread exactly to a different physical core. This might be an interesting exercise.  :icon_idea:

In my own experiments with Hutch's last exercise (not using complications like affinity masks, etc)  the best is achieved with 11 threads, because I have 6 physical cores and the main thread is extremely busy polling. If we don't do polling, and use WaitForMultipleObjects, the best is achieved with 12 threads.
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 13, 2019, 06:37:13 PM
I don't see any problem in your comment, there are variables in play that are not well documented with processor hardware. It was not all that long ago that each thread would add about 90% to the first thread but the first test I did with 4 threads was 4 times faster than a single thread and I suggest it is because the hardware is getting smarter and more efficient. I also don't know what the fetch mechanism is with "rdrand" in relation to the preferred instruction set and while the reference material says the result is derived from thermal noise, it is a more complex instruction that ADD or MOV.

If rdrand is laggy as I suspect, then it means that the thread is not being fully utilised which explains why you can run so many threads. It is not uncommon with socket communication to have under-utilised threads which allows you to run many more threads without processor saturation. Sad to say you learn these things by test piece as the documentation doe not really help you here.
Title: Re: Multi thread rdrand test piece
Post by: daydreamer on June 14, 2019, 01:57:55 AM
interesting,maybe going to try rdrand vs randomgenerator snippet
doesnt it require some avx code to fully utilize a thread/core?



Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 14, 2019, 03:12:20 AM
No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined. :hmmm:
Title: Re: Multi thread rdrand test piece
Post by: daydreamer on June 15, 2019, 01:42:21 AM
Quote from: hutch-- on June 14, 2019, 03:12:20 AM
No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined. :hmmm:
what I have in mind is test rdrand vs a packed SIMD randomgenerators
I havent tested this yet,not made a dword version yet or avx version
movaps xmm7,seedq
paddw xmm7,r12345678q
pmullw xmm7,seedq
movaps seedq,xmm7
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 15, 2019, 02:13:18 AM
 :biggrin:

Let us know when you get it going, producing the seed for AVX will be a ton of fun. You will have to code the random algo yourself as rdrand only works up to QWORD where you need 256 bit data or 128 bit data if you settle for legacy SSE.
Title: Re: Multi thread rdrand test piece
Post by: aw27 on June 15, 2019, 01:58:58 PM
To generate a good seed we can use use RDSEED, it has multiplicative predication resistance. We multiply 4 of them together to get a good 256 bit seed. Probably, it is better to generate 4 or 8 qword seeds in a row and feed the AVX  with them.
Then we can have a lot of fun mass producing pseudo random numbers with AVX, either 1 at a time, 2, 4, 8, etc at a time. For some variations we need to combine vectors.
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 15, 2019, 07:00:45 PM
The only problem with rdseed is you need very later hardware to use it. I would be inclined to do multiple rdrand 64 bit results and combine them to get the unpredictability required. At the moment I am working on a 64 bit seeding algo that does not use rdrand or rdseed so it can be used on older hardware.

This is the play version at the moment. It is about as scientific at the moment as twiddling the numbers.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12,r13,r14

    SaveRegs

    mov r12, 32

    @@:
    rcall seed64
    mov r13, rax
    mov r14, rax

    conout "    ",hex$(r13),"     - "
    conout "    ",str$(r14),lf

    sub r12, 1
    jnz @B

    waitkey
    RestoreRegs
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

seed64 proc

    rdtsc
    pause
    bswap rax
    mov r11, rax

    mov rcx, 10         ; loop count
  @@:
    rdtsc               ; date time count
    pause               ; spinlock pause
    bswap rax           ; reverse byte order
    rol rax, 7          ; rotate left by prime
    xor r11, rax        ; xor rax to r11
    rol r11, 5
    sub rcx, 1          ; decrement counter

    jnz @B
    mov rax, r11
    ret

seed64 endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end
Title: Re: Multi thread rdrand test piece
Post by: aw27 on June 16, 2019, 12:49:33 AM
Linux uses RDRAND together with other sources of entropy and I think RDSEED is being used as well, when available, in more recent kernels.
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 16, 2019, 02:07:58 AM
That is pretty much what I had in mind, rdrand for hardware that supports it, an alternative that is at least probably as good for older hardware and for hardware that supports rdrand , a combination of the two. The combination is slower but I can't see a method of reproducing a random pad made this way which is the target for a random pad.
Title: Re: Multi thread rdrand test piece
Post by: hutch-- on June 16, 2019, 10:37:44 PM
This the first test piece for a rdrand free pad generator. The surprising thing is it is far faster than rdrand. It uses the irand algo in the library and the ENT analysis is at least as good as the rdrand but is much faster. As this test piece is a simple single pass of one seed value, it is not secure in terms of brute forcing but with QWORD seed range, it will at least make them work for it.  :tongue:

To make it secure it needs to be multi seeded and multi passed but its fast enough to do that.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .data?
      seed dq ?
    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12,r13
    LOCAL bcnt  :QWORD
    LOCAL block :QWORD
    LOCAL pMem  :QWORD
    LOCAL bwrt  :QWORD
    LOCAL rqst  :QWORD
    LOCAL tcnt  :QWORD

    SaveRegs

    mov rax, 1024*1024*1024             ; requested pad size
    mov rqst, rax

    mov rax, rqst
    add rax, 1024*64                    ; add 64k
    mov bcnt, rax                       ; store it in bcnt
    mov block, rax                      ; store it in block

    rcall intdiv,block, 8               ; divide block by 8
    mov block, rax

    mov pMem, alloc(bcnt)               ; allocate the byte count

    call reseed                         ; get a seed for the random algo
    rcall seed_irand,rax

    conout "Creating Pad",lf

    mov tcnt, rv(GetTickCount)

    mov r12, block                      ; block is loop counter
    mov r13, pMem                       ; memory address
  @@:
    rcall irand                         ; call random algo
    mov QWORD PTR [r13], rax
    add r13, 8
    sub r12, 1
    jnz @B

    rcall GetTickCount
    sub rax, tcnt

    conout "Timing = ",str$(rax)," ms",lf
    conout "Saving File",lf

    mov bwrt, savefile("random.pad",pMem,rqst)
    mfree pMem

    conout "Bytes written to disk = ",str$(bwrt),lf,lf
    conout "Wait for ENT to complete analysis",lf

    exec "ent random.pad"

    waitkey
    RestoreRegs
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

reseed proc

    mov rax, 100        ; out of range number
    cpuid               ; set all regs to 0
    rdtsc
    pause
    bswap rax
    mov r11, rax

    mov rcx, 10         ; loop count
  @@:
    rdtsc               ; date time counter
    pause               ; spinlock pause
    bswap rax           ; reverse byte order
    rol rax, 7          ; rotate left by prime
    xor r11, rax        ; xor rax to r11
    rol r11, 5          ; rotate left by prime
    sub rcx, 1          ; decrement counter

    jnz @B
    mov rax, r11
    ret

reseed endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end
Title: Re: Multi thread rdrand test piece
Post by: aw27 on June 17, 2019, 09:05:12 PM
We can also use (presumably) true random values from random.org to seed our pseudo random number generator.
Explanation page:
https://www.random.org/clients/http/

This is the C code to access the site and retrieve values. Although it is very easy to convert to Masm I will not do it at this time:


#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
#include <wininet.h>

#define COUNTVALUES 10
#define MINVAL 0
#define MAXVAL 65535

int main()
{
HINTERNET   hInternet, hConnect, hHttpRequest;
char  szData[1024] = { 0 };
int  intArray[COUNTVALUES] = { 0 };
char* next;
char* curr;
char queryStr[1024] = { 0 };
int i;

sprintf(queryStr, "/integers/?num=%d&min=%d&max=%d&col=1&base=10&format=plain&rnd=new", COUNTVALUES, MINVAL, MAXVAL);

hInternet = InternetOpen("Random.org test", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hInternet != NULL)
{
hConnect = InternetConnect(hInternet, "www.random.org", INTERNET_DEFAULT_HTTPS_PORT, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
if (hConnect != NULL)
{
if ((hHttpRequest = HttpOpenRequest(hConnect, "GET", queryStr, 0, 0, 0, INTERNET_FLAG_SECURE, 0)))
{
BOOL isSend = HttpSendRequest(hHttpRequest, NULL, 0, NULL, 0);
if (isSend)
{
for (;;)
{
DWORD dwByteRead;
BOOL isRead = InternetReadFile(hHttpRequest, szData, sizeof(szData) - 1, &dwByteRead);

if (isRead == FALSE || dwByteRead == 0)
break;

szData[dwByteRead] = 0;
}
}
}
InternetCloseHandle(hHttpRequest);
}
InternetCloseHandle(hConnect);
}
InternetCloseHandle(hInternet);

//printf("%s\n", szData);

printf("*** %d TRUE RANDOM integer values from RANDOM.ORG in the range %d - %d ***\n\n", COUNTVALUES, MINVAL, MAXVAL);

curr = szData;

for(i=0;i< COUNTVALUES;i++)
{
intArray[i] = atoi(curr);
printf("%d\n", intArray[i]);
next = strchr(curr, '\n');
if (next == NULL)
break;
curr = next + 1;
}
_getch();
return 0;
}




I attach an exe built with the above code.

*** 10 TRUE RANDOM integer values from RANDOM.ORG in the range 0 - 65535 ***

38712
30412
42693
39657
60054
63719
61011
3045
27759
50838