Author Topic: Modular multiplicative inverse  (Read 1134 times)

nidud

  • Member
  • *****
  • Posts: 1800
    • https://github.com/nidud/asmc
Re: Modular multiplicative inverse
« Reply #15 on: September 13, 2019, 10:28:51 AM »
At least you understood the mess enough to roll up your sleeves and spend ten minutes looking for a way to save 2 milliseconds.  :thumbsup:

That's one way of looking at it I guess but 100392 versus 37749 cycles is more than 60% reduction in energy consumption: your battery will last 10 days instead of 4. You may increase your bitcoin production by 60% and all the other reasons why people use assembler.

Why do you write assembler code and why are you here?

Quote
Do you mean beer metal?   :rolleyes:

Probably a Russian bear so the metal must be a reference to Stalin.

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Modular multiplicative inverse
« Reply #16 on: September 13, 2019, 04:56:30 PM »
why are you here?

I am here mostly to have fun with people that publish performance results but don't show how they got them.

In particular, I would like to see how you used RDTSC (http://www.forwardscattering.org/post/15) to measure the clock cycles    :skrewy:

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Modular multiplicative inverse
« Reply #17 on: September 14, 2019, 02:45:01 AM »
 :badgrin:

Interesting that when I replace your highly optimized polyDivision with my messy polyDivision the performance difference is about 13%. (60 versus 68 cycles, no chance to cook bitcoins  :sad: with this difference)
Cycles polynomial Division: 60 (53898/896) versus Cycles polynomial Division: 68 (61178/896)

I am yet to discover where you obtained 100106 cycles for the original. and 63363 cycles for the optimized. but I am going to think the whole weekend about this and may be able to find out. The likely reason might be related to the bear metal (it does happen  :thumbsup:).

Yes, I know I have disregarded some practices like flushing the queue to guarantee serial execution of RDTSC, but this was just to illustrate that your conclusions are delirious. Sure, and I believe as well that these tests are mostly meaningless for modern processors but this is another discussion.

Code: [Select]
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++; // NEW
a=__rdtsc(); // NEW
polyDivision(r, newr, &quotient, &remain);
//remain = polyDivision2(r, newr, &quotient); // NEW
b = __rdtsc(); // NEW
accumDif += b - a; // NEW
r = newr;
newr = remain;
polyMultiply(quotient, newt, &tempByte);
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter); // NEW

return 0;
}








AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Modular multiplicative inverse
« Reply #18 on: September 14, 2019, 06:08:51 PM »
 :skrewy:

And for the GF(2^8) multiply the results are "Cycles GF(2^8) multiply: 105 (94234/896)" for the original and "Cycles GF(2^8) multiply: 90 (80948/896)" for the super-ultra-hyper optimized version from Nidud. A decent, but not impressive, 11.7% improvements.  :greenclp:
But I am still waiting for your report explaining where you obtained the 100392 cycles versus 37749 cycles improvement (266% improvement  :badgrin:). I guess I will have to wait a long long time.  :sad:

Code: [Select]
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyMultiply2(unsigned char mult1, unsigned char mult2);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++;
//a=__rdtsc();
polyDivision(r, newr, &quotient, &remain);
//remain = polyDivision2(r, newr, &quotient);
//b = __rdtsc();
//accumDif += b - a;
r = newr;
newr = remain;
a = __rdtsc();
//polyMultiply(quotient, newt, &tempByte);
tempByte = polyMultiply2(quotient, newt); //NEW
b = __rdtsc();
accumDif += b - a;
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

//printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter);
printf("Cycles GF(2^8) multiply: %d (%lld/%d)", (int)(accumDif / counter), accumDif, counter);

return 0;
}



nidud

  • Member
  • *****
  • Posts: 1800
    • https://github.com/nidud/asmc
Re: Modular multiplicative inverse
« Reply #19 on: September 14, 2019, 09:33:31 PM »
 :biggrin:

Still confused I see. The original claim:
Quote
it appears like you converted it from some HLL.

Then you produced this:

        mov al, mult2
        shr al,1
        mov mult2, al

A simple test case:

foo proc
  local tmp:byte
  repeat 50
    mov al,tmp
    shr al,1
    mov tmp,al
  endm
    ret
foo endp
...
  repeat 50
    shr tmp,1
  endm
...
  mov al,tmp
  repeat 50
    shr al,1
  endm

result:

Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(0)
   237712 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   233321 cycles, rep(1000), code(160) 1.asm: shr m,1
    41915 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
   236884 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   223148 cycles, rep(1000), code(160) 1.asm: shr m,1
    41330 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
   232043 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   223281 cycles, rep(1000), code(160) 1.asm: shr m,1
    40657 cycles, rep(1000), code(113) 2.asm: shr al,1

total [0 .. 2], 1++
   123902 cycles 2.asm: shr al,1
   679750 cycles 1.asm: shr m,1
   706639 cycles 0.asm: mov al,m | shr al,1 | mov m,al

bloat: 410/113
speed: 70/12

Quote
but I am going to think the whole weekend about this and may be able to find out
Well, good luck and have a nice weekend  :thumbsup:

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Modular multiplicative inverse
« Reply #20 on: September 15, 2019, 12:33:54 AM »
It took me more than a reasonable amount of time to build what you attached here because if there is something I dislike in ASMC is that it is not user friendly. It is made by one ultra nerd for other ultra nerds.

Now, your test does not make sense. You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles! Thank you my friend!
But in the whole procedure that does weight much as I have demonstrated. Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.

Have a nice day.  :skrewy:

nidud

  • Member
  • *****
  • Posts: 1800
    • https://github.com/nidud/asmc
Re: Modular multiplicative inverse
« Reply #21 on: September 17, 2019, 03:36:49 AM »
It took me more than a reasonable amount of time to build what you attached here

It's just a batch file.

Quote
because if there is something I dislike in ASMC is that it is not user friendly.

It's a masm compatible assembler so it shouldn't be that complicated.

asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm

As for the timing code this is written by MichaelW which is located in \masm32\macros\timers.asm. You will find hundredths of samples of this in the Lab.

Quote
It is made by one ultra nerd for other ultra nerds.

A person with low social skills and a mental capacity of a child but still capable of writing some usable HLL code.

Quote
Now, your test does not make sense.

Hence the question of your presence here. You enter this forum by producing a document claiming it's impossible to beat optimized compiler code. When pressed on this you claim it doesn't matter so why assembler? It doesn't make any sense.

Quote
You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles!

7:1 is 700% and you do this consistently throughout your code so this will obviously have a negative effect on performance.

Quote
But in the whole procedure that does weight much as I have demonstrated.

Demonstrated how exactly? You think your translation performs any better than the original HLL code?

Quote
Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.

So you just added all this bloat because you had some time to kill?

Well, my rant was more about your social skills or lack thereof in this case where you constantly dunk on others for tings you consistently do all the time yourself.

AW

  • Member
  • *****
  • Posts: 2442
  • Let's Make ASM Great Again!
Re: Modular multiplicative inverse
« Reply #22 on: September 17, 2019, 04:39:29 AM »
You are in a state of reality denial and I will, of course, not feed the hallucination.

Quote
It's a masm compatible assembler so it shouldn't be that complicated.
asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm

Try to build what you dropped here like that then.   :badgrin:
The 3 *.asm files need to be built to .bin, a trickery that only JWASM descendents have, and the .cmd file is not a file to be run by cmd.exe  :rolleyes: it is an ASM file. WTF!  :skrewy:

HSE

  • Member
  • *****
  • Posts: 1148
  • <AMD>< 7-32>
Re: Modular multiplicative inverse
« Reply #23 on: September 17, 2019, 04:56:21 AM »
I don't knew that CMD can make that! Interesting :thumbsup:

Here making .bin with UASM32:
Code: [Select]
\nidud\shr>uasm32 -q -bin ?.asm

\nidud\shr>asmc -q -pe "\nidud\shr\timeit.cmd"

\nidud\shr>timeit.exe
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
----------------------------------------------
-- test(0)
   380926 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   353402 cycles, rep(1000), code(160) 1.asm: shr m,1
    50546 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
   362377 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358679 cycles, rep(1000), code(160) 1.asm: shr m,1
    48543 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
   351191 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358392 cycles, rep(1000), code(160) 1.asm: shr m,1
    49500 cycles, rep(1000), code(113) 2.asm: shr al,1

total [0 .. 2], 1++
   148589 cycles 2.asm: shr al,1
  1070473 cycles 1.asm: shr m,1
  1094494 cycles 0.asm: mov al,m | shr al,1 | mov m,al
hit any key to continue...