The MASM Forum
Microsoft 64 bit MASM => Examples => Topic started by: hutch-- on June 12, 2019, 05:17:56 AM
-
This test piece was designed to run 4 threads with the same code to create 4 buffers filled with rdrand random numbers that are saves to disk sequentially to make up the total size. Over an earlier test piece using the same loop code the 4 thread version clocks about 4 times faster which is surprising in that you usually don't get 4 times the speed with 4 threads. Results are in the good quality range and the test piece automatically runs ENT to see if the results are any good.
With "rdrand" there is some lingering suspicion that security agencies may have forced Intel to put a back door in the hardware so using this technique would need another random algo to combined with the rdrand version but the current style of output is good quality random and if you have need of random pads other than for security, this will work well.
Note that the file IO is very ordinary but the random generation is reasonably fast.
-
Hi Hutch,
my results...
Creating pad
Thread 1 here
Thread 2 here
Thread 3 here
Thread 4 here
Writing pad to disk
File written to disk at 1024 megabytes
Analysing output
Wait for the result
Entropy = 8.000000 bits per byte.
Optimum compression would reduce the size
of this 1073741824 byte file by 0 percent.
Chi square distribution for 1073741824 samples is 297.57, and randomly
would exceed this value 5.00 percent of the times.
Arithmetic mean value of data bytes is 127.5028 (127.5 = random).
Monte Carlo value for Pi is 3.141555604 (error 0.00 percent).
Serial correlation coefficient is -0.000020 (totally uncorrelated = 0.0).
-
This is a variation with 64 threads, using WaitForMultipleObjects.
Incidentally, I could not make conout to print properly from the threads (I did not found the reason), so I ended up using vc_printf.
-
That worked well Jose, added timing to it and its about 30 - 35% faster than the 4 thread version. When I get a bit further along I will play with a higher thread count, the example was aimed at 4 cores as that what most people are running at the moment.
-
This is the next version, primarily testing the increase thread count and how it effects timing. I have stuck with the spinlock to avoid the 64 thread limit and have tested 128 and 256 threads but it does not get faster than the 64 thread count. Testing over the thread count range, 4 threads runs at about the speed of my earlier test piece, 8 is a lot faster and you get incremental gains up to 64 threads.
Older hardware may not be happy with high thread counts but should handle 16 or 32 threads.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.data
flag dq ? ; completion flag
bsiz dq ? ; buffer size
blsz dq ? ; block size
.code
tcnt equ <64> ; <8> <16> <32> <64> <128> <256> ; thread count
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
USING r12,r13,r14,r15
LOCAL hFile :QWORD
LOCAL bwrt :QWORD
SaveRegs ; save volatile registers
mov flag, 0 ; zero the completion flag
mov bsiz, 1024*1024*1024 ; allocated size
mov blsz, 1024*1024*1024/tcnt ; individual block size
conout "Creating random pad",lf
mov r15, alloc(bsiz) ; allocate single block
mov r14, rvcall(GetTickCount) ; start the timing
mov r12, tcnt ; tcnt thread count
mov r13, r15
@@:
invoke CreateThread,0,0,ptr$(Thread),r13,0,0
rcall CloseHandle, rax
add r13, blsz ; set next write location
sub r12, 1
jnz @B
spinlock: ; poll for completion count
cmp flag, tcnt
jne spinlock
rcall GetTickCount
sub rax, r14
conout "Timing = ",str$(rax)," ms",lf ; display timing results
conout "Thread Completion Count = ",str$(flag),lf
conout "Saving file to disk",lf
mov bwrt, savefile("test.pad",r15,bsiz) ; write result to disk
exec "ent test.pad"
waitkey "Wait for ENT to complete",lf
RestoreRegs ; restor volatile registers
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
Thread proc
mov rdx, blsz ; load the block size
@@:
rdrand rax
mov [rcx], rax ; rcx is the buffer address
add rcx, 8
sub rdx, 8
jnz @B
add flag, 1 ; increment the completion flag
ret
Thread endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
-
Probably, in this particular scenario having more threads than the number of physical cores (not logical cores) will not bring any real advantage.
For most people it is 4 or 6 physical cores, up to 32 for people with AMD Threadripper.
However, this is not always the case in particular, when threads stale waiting for a resource.
-
I want to clarify my previous statement, which apparently contradict Hutch' findings. It does not.
When I said that number of threads equaling the number of physical cores is probably the best we can do in this scenario, there is a question I did not answer. On a CPU with 4 physical cores/8 logical threads, when I launch a program with 4 threads how can I guarantee that each of them goes to a different physical core? I can not, some physical core(s) may receive more than 1 thread. Some people mentioned that in Windows, logical and physical cores are interleaved, if this is true we can play with affinity masks to launch a thread exactly to a different physical core. This might be an interesting exercise. :icon_idea:
In my own experiments with Hutch's last exercise (not using complications like affinity masks, etc) the best is achieved with 11 threads, because I have 6 physical cores and the main thread is extremely busy polling. If we don't do polling, and use WaitForMultipleObjects, the best is achieved with 12 threads.
-
I don't see any problem in your comment, there are variables in play that are not well documented with processor hardware. It was not all that long ago that each thread would add about 90% to the first thread but the first test I did with 4 threads was 4 times faster than a single thread and I suggest it is because the hardware is getting smarter and more efficient. I also don't know what the fetch mechanism is with "rdrand" in relation to the preferred instruction set and while the reference material says the result is derived from thermal noise, it is a more complex instruction that ADD or MOV.
If rdrand is laggy as I suspect, then it means that the thread is not being fully utilised which explains why you can run so many threads. It is not uncommon with socket communication to have under-utilised threads which allows you to run many more threads without processor saturation. Sad to say you learn these things by test piece as the documentation doe not really help you here.
-
interesting,maybe going to try rdrand vs randomgenerator snippet
doesnt it require some avx code to fully utilize a thread/core?
-
No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined. :hmmm:
-
No, it works on ax eax and rax or any of the other 64 bit integer registers. The combination I have in mind is rdrand AND a seeded algo combined. :hmmm:
what I have in mind is test rdrand vs a packed SIMD randomgenerators
I havent tested this yet,not made a dword version yet or avx version
movaps xmm7,seedq
paddw xmm7,r12345678q
pmullw xmm7,seedq
movaps seedq,xmm7
-
:biggrin:
Let us know when you get it going, producing the seed for AVX will be a ton of fun. You will have to code the random algo yourself as rdrand only works up to QWORD where you need 256 bit data or 128 bit data if you settle for legacy SSE.
-
To generate a good seed we can use use RDSEED, it has multiplicative predication resistance. We multiply 4 of them together to get a good 256 bit seed. Probably, it is better to generate 4 or 8 qword seeds in a row and feed the AVX with them.
Then we can have a lot of fun mass producing pseudo random numbers with AVX, either 1 at a time, 2, 4, 8, etc at a time. For some variations we need to combine vectors.
-
The only problem with rdseed is you need very later hardware to use it. I would be inclined to do multiple rdrand 64 bit results and combine them to get the unpredictability required. At the moment I am working on a 64 bit seeding algo that does not use rdrand or rdseed so it can be used on older hardware.
This is the play version at the moment. It is about as scientific at the moment as twiddling the numbers.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
USING r12,r13,r14
SaveRegs
mov r12, 32
@@:
rcall seed64
mov r13, rax
mov r14, rax
conout " ",hex$(r13)," - "
conout " ",str$(r14),lf
sub r12, 1
jnz @B
waitkey
RestoreRegs
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
seed64 proc
rdtsc
pause
bswap rax
mov r11, rax
mov rcx, 10 ; loop count
@@:
rdtsc ; date time count
pause ; spinlock pause
bswap rax ; reverse byte order
rol rax, 7 ; rotate left by prime
xor r11, rax ; xor rax to r11
rol r11, 5
sub rcx, 1 ; decrement counter
jnz @B
mov rax, r11
ret
seed64 endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
-
Linux uses RDRAND together with other sources of entropy and I think RDSEED is being used as well, when available, in more recent kernels.
-
That is pretty much what I had in mind, rdrand for hardware that supports it, an alternative that is at least probably as good for older hardware and for hardware that supports rdrand , a combination of the two. The combination is slower but I can't see a method of reproducing a random pad made this way which is the target for a random pad.
-
This the first test piece for a rdrand free pad generator. The surprising thing is it is far faster than rdrand. It uses the irand algo in the library and the ENT analysis is at least as good as the rdrand but is much faster. As this test piece is a simple single pass of one seed value, it is not secure in terms of brute forcing but with QWORD seed range, it will at least make them work for it. :tongue:
To make it secure it needs to be multi seeded and multi passed but its fast enough to do that.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.data?
seed dq ?
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
USING r12,r13
LOCAL bcnt :QWORD
LOCAL block :QWORD
LOCAL pMem :QWORD
LOCAL bwrt :QWORD
LOCAL rqst :QWORD
LOCAL tcnt :QWORD
SaveRegs
mov rax, 1024*1024*1024 ; requested pad size
mov rqst, rax
mov rax, rqst
add rax, 1024*64 ; add 64k
mov bcnt, rax ; store it in bcnt
mov block, rax ; store it in block
rcall intdiv,block, 8 ; divide block by 8
mov block, rax
mov pMem, alloc(bcnt) ; allocate the byte count
call reseed ; get a seed for the random algo
rcall seed_irand,rax
conout "Creating Pad",lf
mov tcnt, rv(GetTickCount)
mov r12, block ; block is loop counter
mov r13, pMem ; memory address
@@:
rcall irand ; call random algo
mov QWORD PTR [r13], rax
add r13, 8
sub r12, 1
jnz @B
rcall GetTickCount
sub rax, tcnt
conout "Timing = ",str$(rax)," ms",lf
conout "Saving File",lf
mov bwrt, savefile("random.pad",pMem,rqst)
mfree pMem
conout "Bytes written to disk = ",str$(bwrt),lf,lf
conout "Wait for ENT to complete analysis",lf
exec "ent random.pad"
waitkey
RestoreRegs
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
reseed proc
mov rax, 100 ; out of range number
cpuid ; set all regs to 0
rdtsc
pause
bswap rax
mov r11, rax
mov rcx, 10 ; loop count
@@:
rdtsc ; date time counter
pause ; spinlock pause
bswap rax ; reverse byte order
rol rax, 7 ; rotate left by prime
xor r11, rax ; xor rax to r11
rol r11, 5 ; rotate left by prime
sub rcx, 1 ; decrement counter
jnz @B
mov rax, r11
ret
reseed endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
-
We can also use (presumably) true random values from random.org to seed our pseudo random number generator.
Explanation page:
https://www.random.org/clients/http/
This is the C code to access the site and retrieve values. Although it is very easy to convert to Masm I will not do it at this time:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
#include <wininet.h>
#define COUNTVALUES 10
#define MINVAL 0
#define MAXVAL 65535
int main()
{
HINTERNET hInternet, hConnect, hHttpRequest;
char szData[1024] = { 0 };
int intArray[COUNTVALUES] = { 0 };
char* next;
char* curr;
char queryStr[1024] = { 0 };
int i;
sprintf(queryStr, "/integers/?num=%d&min=%d&max=%d&col=1&base=10&format=plain&rnd=new", COUNTVALUES, MINVAL, MAXVAL);
hInternet = InternetOpen("Random.org test", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hInternet != NULL)
{
hConnect = InternetConnect(hInternet, "www.random.org", INTERNET_DEFAULT_HTTPS_PORT, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
if (hConnect != NULL)
{
if ((hHttpRequest = HttpOpenRequest(hConnect, "GET", queryStr, 0, 0, 0, INTERNET_FLAG_SECURE, 0)))
{
BOOL isSend = HttpSendRequest(hHttpRequest, NULL, 0, NULL, 0);
if (isSend)
{
for (;;)
{
DWORD dwByteRead;
BOOL isRead = InternetReadFile(hHttpRequest, szData, sizeof(szData) - 1, &dwByteRead);
if (isRead == FALSE || dwByteRead == 0)
break;
szData[dwByteRead] = 0;
}
}
}
InternetCloseHandle(hHttpRequest);
}
InternetCloseHandle(hConnect);
}
InternetCloseHandle(hInternet);
//printf("%s\n", szData);
printf("*** %d TRUE RANDOM integer values from RANDOM.ORG in the range %d - %d ***\n\n", COUNTVALUES, MINVAL, MAXVAL);
curr = szData;
for(i=0;i< COUNTVALUES;i++)
{
intArray[i] = atoi(curr);
printf("%d\n", intArray[i]);
next = strchr(curr, '\n');
if (next == NULL)
break;
curr = next + 1;
}
_getch();
return 0;
}
I attach an exe built with the above code.
*** 10 TRUE RANDOM integer values from RANDOM.ORG in the range 0 - 65535 ***
38712
30412
42693
39657
60054
63719
61011
3045
27759
50838