News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Modular multiplicative inverse

Started by mabdelouahab, September 09, 2019, 07:31:44 AM

Previous topic - Next topic

nidud

#15
deleted

aw27

Quote from: nidud on September 13, 2019, 10:28:51 AM
why are you here?

I am here mostly to have fun with people that publish performance results but don't show how they got them.

In particular, I would like to see how you used RDTSC (http://www.forwardscattering.org/post/15) to measure the clock cycles    :skrewy:

aw27

 :badgrin:

Interesting that when I replace your highly optimized polyDivision with my messy polyDivision the performance difference is about 13%. (60 versus 68 cycles, no chance to cook bitcoins  :sad: with this difference)
Cycles polynomial Division: 60 (53898/896) versus Cycles polynomial Division: 68 (61178/896)

I am yet to discover where you obtained 100106 cycles for the original. and 63363 cycles for the optimized. but I am going to think the whole weekend about this and may be able to find out. The likely reason might be related to the bear metal (it does happen  :thumbsup:).

Yes, I know I have disregarded some practices like flushing the queue to guarantee serial execution of RDTSC, but this was just to illustrate that your conclusions are delirious. Sure, and I believe as well that these tests are mostly meaningless for modern processors but this is another discussion.


#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++; // NEW
a=__rdtsc(); // NEW
polyDivision(r, newr, &quotient, &remain);
//remain = polyDivision2(r, newr, &quotient); // NEW
b = __rdtsc(); // NEW
accumDif += b - a; // NEW
r = newr;
newr = remain;
polyMultiply(quotient, newt, &tempByte);
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter); // NEW

return 0;
}









aw27

 :skrewy:

And for the GF(2^8) multiply the results are "Cycles GF(2^8) multiply: 105 (94234/896)" for the original and "Cycles GF(2^8) multiply: 90 (80948/896)" for the super-ultra-hyper optimized version from Nidud. A decent, but not impressive, 11.7% improvements.  :greenclp:
But I am still waiting for your report explaining where you obtained the 100392 cycles versus 37749 cycles improvement (266% improvement  :badgrin:). I guess I will have to wait a long long time.  :sad:


#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyMultiply2(unsigned char mult1, unsigned char mult2);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++;
//a=__rdtsc();
polyDivision(r, newr, &quotient, &remain);
//remain = polyDivision2(r, newr, &quotient);
//b = __rdtsc();
//accumDif += b - a;
r = newr;
newr = remain;
a = __rdtsc();
//polyMultiply(quotient, newt, &tempByte);
tempByte = polyMultiply2(quotient, newt); //NEW
b = __rdtsc();
accumDif += b - a;
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

//printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter);
printf("Cycles GF(2^8) multiply: %d (%lld/%d)", (int)(accumDif / counter), accumDif, counter);

return 0;
}




nidud

#19
deleted

aw27

It took me more than a reasonable amount of time to build what you attached here because if there is something I dislike in ASMC is that it is not user friendly. It is made by one ultra nerd for other ultra nerds.

Now, your test does not make sense. You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles! Thank you my friend!
But in the whole procedure that does weight much as I have demonstrated. Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.

Have a nice day.  :skrewy:

nidud

#21
deleted

aw27

You are in a state of reality denial and I will, of course, not feed the hallucination.

Quote
It's a masm compatible assembler so it shouldn't be that complicated.
asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm

Try to build what you dropped here like that then.   :badgrin:
The 3 *.asm files need to be built to .bin, a trickery that only JWASM descendents have, and the .cmd file is not a file to be run by cmd.exe  :rolleyes: it is an ASM file. WTF!  :skrewy:

HSE

I don't knew that CMD can make that! Interesting :thumbsup:

Here making .bin with UASM32:\nidud\shr>uasm32 -q -bin ?.asm

\nidud\shr>asmc -q -pe "\nidud\shr\timeit.cmd"

\nidud\shr>timeit.exe
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
----------------------------------------------
-- test(0)
   380926 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   353402 cycles, rep(1000), code(160) 1.asm: shr m,1
    50546 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
   362377 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358679 cycles, rep(1000), code(160) 1.asm: shr m,1
    48543 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
   351191 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358392 cycles, rep(1000), code(160) 1.asm: shr m,1
    49500 cycles, rep(1000), code(113) 2.asm: shr al,1

total [0 .. 2], 1++
   148589 cycles 2.asm: shr al,1
  1070473 cycles 1.asm: shr m,1
  1094494 cycles 0.asm: mov al,m | shr al,1 | mov m,al
hit any key to continue...
Equations in Assembly: SmplMath