Modular multiplicative inverse

nidud · September 13, 2019, 10:28:51 AM

deleted

aw27 · September 13, 2019, 04:56:30 PM

Quote from: nidud on September 13, 2019, 10:28:51 AM
why are you here?

I am here mostly to have fun with people that publish performance results but don't show how they got them.

In particular, I would like to see how you used RDTSC (http://www.forwardscattering.org/post/15) to measure the clock cycles

aw27 · September 14, 2019, 02:45:01 AM

Interesting that when I replace your highly optimized polyDivision with my messy polyDivision the performance difference is about 13%. (60 versus 68 cycles, no chance to cook bitcoins

with this difference)
Cycles polynomial Division: 60 (53898/896) versus Cycles polynomial Division: 68 (61178/896)

I am yet to discover where you obtained 100106 cycles for the original. and 63363 cycles for the optimized. but I am going to think the whole weekend about this and may be able to find out. The likely reason might be related to the bear metal (it does happen

).

Yes, I know I have disregarded some practices like flushing the queue to guarantee serial execution of RDTSC, but this was just to illustrate that your conclusions are delirious. Sure, and I believe as well that these tests are mostly meaningless for modern processors but this is another discussion.

Code Select


#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
	int i, j;
	int t, newt, r, newr, quotient, remain;
	unsigned char tempByte, tempNew, printValue, tempNewr;
	unsigned long long a, b;
	int counter = 0;
	unsigned long long accumDif = 0;

	for (i = 0; i < 16; i++)
		printf("\t%.2x", i);

	printf("\n\n");

	for (i = 0; i < 16; i++)
	{
		printf("%.2x\t", i);
		for (j = 0; j < 16; j++)
		{
			tempNewr = i * 16 + j;
			newr = tempNewr;
			r = 0x11B;
			t = 0;
			newt = 1;
			if (newr>1)
			{
				while (newr != 1)
				{
					counter++; // NEW
					a=__rdtsc(); // NEW
					polyDivision(r, newr, &quotient, &remain);
					//remain = polyDivision2(r, newr, &quotient); // NEW
					b = __rdtsc(); // NEW
					accumDif += b - a; // NEW
					r = newr;
					newr = remain;
					polyMultiply(quotient, newt, &tempByte);
					tempNew = polyAdd(t, tempByte);
					t = newt;
					newt = tempNew;
				}
			}
			if (tempNewr < 2)
				printValue = tempNewr;
			else
				printValue = tempNew;
			printf("%.2x\t", printValue);

		}
		printf("\n");
	}
	
	printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter); // NEW

	return 0;
}

aw27 · September 14, 2019, 06:08:51 PM

And for the GF(2^8) multiply the results are "Cycles GF(2^8) multiply: 105 (94234/896)" for the original and "Cycles GF(2^8) multiply: 90 (80948/896)" for the super-ultra-hyper optimized version from Nidud. A decent, but not impressive, 11.7% improvements.

But I am still waiting for your report explaining where you obtained the 100392 cycles versus 37749 cycles improvement (266% improvement

). I guess I will have to wait a long long time.

Code Select


#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyMultiply2(unsigned char mult1, unsigned char mult2);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
	int i, j;
	int t, newt, r, newr, quotient, remain;
	unsigned char tempByte, tempNew, printValue, tempNewr;
	unsigned long long a, b;
	int counter = 0;
	unsigned long long accumDif = 0;

	for (i = 0; i < 16; i++)
		printf("\t%.2x", i);

	printf("\n\n");

	for (i = 0; i < 16; i++)
	{
		printf("%.2x\t", i);
		for (j = 0; j < 16; j++)
		{
			tempNewr = i * 16 + j;
			newr = tempNewr;
			r = 0x11B;
			t = 0;
			newt = 1;
			if (newr>1)
			{
				while (newr != 1)
				{
					counter++;
					//a=__rdtsc();
					polyDivision(r, newr, &quotient, &remain);
					//remain = polyDivision2(r, newr, &quotient);
					//b = __rdtsc();
					//accumDif += b - a;
					r = newr;
					newr = remain;
					a = __rdtsc();
					//polyMultiply(quotient, newt, &tempByte);
					tempByte = polyMultiply2(quotient, newt); //NEW
					b = __rdtsc();
					accumDif += b - a;
					tempNew = polyAdd(t, tempByte);
					t = newt;
					newt = tempNew;
				}
			}
			if (tempNewr < 2)
				printValue = tempNewr;
			else
				printValue = tempNew;
			printf("%.2x\t", printValue);

		}
		printf("\n");
	}
	
	//printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter);
	printf("Cycles GF(2^8) multiply: %d (%lld/%d)", (int)(accumDif / counter), accumDif, counter);

	return 0;
}

nidud · September 14, 2019, 09:33:31 PM

deleted

aw27 · September 15, 2019, 12:33:54 AM

It took me more than a reasonable amount of time to build what you attached here because if there is something I dislike in ASMC is that it is not user friendly. It is made by one ultra nerd for other ultra nerds.

Now, your test does not make sense. You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles! Thank you my friend!
But in the whole procedure that does weight much as I have demonstrated. Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.

Have a nice day.

nidud · September 17, 2019, 03:36:49 AM

deleted

aw27 · September 17, 2019, 04:39:29 AM

You are in a state of reality denial and I will, of course, not feed the hallucination.

Quote
It's a masm compatible assembler so it shouldn't be that complicated.
asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm

Try to build what you dropped here like that then.

The 3 *.asm files need to be built to .bin, a trickery that only JWASM descendents have, and the .cmd file is not a file to be run by cmd.exe

it is an ASM file. WTF!

HSE · September 17, 2019, 04:56:21 AM

I don't knew that CMD can make that! Interesting

Here making .bin with UASM32:

Code Select

\nidud\shr>uasm32 -q -bin ?.asm

\nidud\shr>asmc -q -pe "\nidud\shr\timeit.cmd"

\nidud\shr>timeit.exe
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
----------------------------------------------
-- test(0)
   380926 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   353402 cycles, rep(1000), code(160) 1.asm: shr m,1
    50546 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
   362377 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358679 cycles, rep(1000), code(160) 1.asm: shr m,1
    48543 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
   351191 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358392 cycles, rep(1000), code(160) 1.asm: shr m,1
    49500 cycles, rep(1000), code(113) 2.asm: shr al,1

total [0 .. 2], 1++
   148589 cycles 2.asm: shr al,1
  1070473 cycles 1.asm: shr m,1
  1094494 cycles 0.asm: mov al,m | shr al,1 | mov m,al
hit any key to continue...

The MASM Forum

News:

Modular multiplicative inverse

nidud

aw27

aw27

aw27

nidud

aw27

nidud

aw27

HSE